Like, uh, you know, is being figured out in the background, then you will not notice, or maybe it will again happen, or reach a peak, you know, and not be able to deliver, people who lose interest and then will have to go again through. This I called the I Winter, so, you know, go to a Jennifer, and I Winter, and then to kind of recover from that. Yeah. in 2006? Yeah, one is the landmark Lat seems so because is that. was this people in app So there was this schemes of training V architecture layer-wise, using either IBMs or the encoders. So Jeff likes sort of the idea formulation, you’re sure, like, you’re going for the termination, but, um, There’s this idea that now you could train a deeper architecture by layer-wise pre-training, each layer, and then stacking them together and then learning it by style. Nope. If you talk to, or, you know, you guys end up playing that, you actually figured this out 10 years before and had a paper. But I think NTL sort of the sort of deep belief, there’s why it sort of stage, this wasn’t reproduced on the same scale. Like, you believe, so you have people in that respect of that motors are sort of another thing. I think Yalik Kun was doing something with I was parsity, part for that. Anyway, but like you started getting more and more groups that were able to come up with different recipes that would do sort of this replaying layer-wise new training, then were successful. So it felt like, okay, something was figured out because before maybe there was one group that I knew that I could do it, and not a lot of people introducing it now, like every group came up with their own recipe and all of them work. So it felt like something was happening. Usually when multiple people independently you guys are, you know, managed to make the same thing work, maybe you decide validations. So it’s usually a side that something has happened and now everyone can do it. Like if we get just one group that is able to do something and no one else is able to do the same, there’s usually something weird. But like when it kind of scales up and everyone is able to come up with their own recipe, it means that something has changed in sort of how we approach this problem and, um, where sort of at the point where we’re starting to solve the problem. why did you even need 6 years for Alex? Like we only know that getting like sk. So, um, I mean, okay, so, so, the whole understand is kind of in music. So, I mean, obviously no one knew exactly what it was called this, the community before, like, you know, let’s put it this way. It’s not like in 2006 everyone knew that, oh, if we do I listen that, you know, everyone will buy it and it be fine. If everyone, if they knew that, they would put their forces together and just do it. Um, the way Alex met happened is, um, Unless, uh, Twitchis, if I’m confident, correct, it was the 1st of all, the paper. So he was… Hop anonymizing things up. He was a PG student in the English group. I don’t know think it was a master. It was a P think uh student, Um, his background wasn’t necessarily machine learning. So if he was basically, um, or it doesn’t matter what his background, but sort of the character, he was more of a hacker kind of like, like to write, uh, hair and all, uh, you know, uh, I mean, in the background, okay, so there’s multiple things that are happening in the background. The other thing that happening in the background these sort of GPUs are becoming popular because of games and whatnot. And, um, I think what happened is just took this path and that alleys to the student that told him that I, look, if you manage to improve the numbers or image that by a few percentage, you’re gonna get their PDPs. Like you don’t need to worry about sort of standard research, just get those numbers up. Um, what through Alex Net will happen is really sort of, uh, um, like there’s no novelty in the paper itself. Like, all of the methods that were kind of, you know, a net were kind of around in previous papers. What really took that thing to work was some specializedct kels that were able to use multiple of the same type of training on models. And that was something that was not necessarily easy to do. And if I remember correctly, but that kind of thing where Alex was interacting with engineers and NVDA, because, um, like the, the, the, the, the libraries that the media had weren’t transformation learning, there weren’t that documented, that they weren’t able to do the things he wanted to do. So you got to that kind of low level where, you know, you was talking to this guy, they’re like, I need this pretty team and whatnot. But that’s kind of how Alex that happened. And it actually worked much better than everyone expected. So what Jeff was hoping for is the annual airport is to be on par with previous matters. What the outcome was that it was way better. I forgot the difference, but it was like a significant jump in performance. To the point that, you know, specifically years after Alex met, a lot of computer vision groups tried to get to that kind of performance with traditional methods that they couldn’t. And that’s sort of when they realised that I grew, we’re not, we’re never gonna, we like sift and all of these other tricks that we were doing, we’re not gonna be able to get there. So they had all of this I anyway, so that’s why sort of the change kind of happened, right? So you thought Janice had paper that had a big number, but there was like a bunch of groups trying to beat that number. They couldn’t. they were very far away. Then, um, There was Andrews, these are a man who is like a very big name in computer teaching community that everyone tries to. So his group switched to neural networks and they started working and, you know, like at that point, you know, like it’s one thing to have a really good result coming from, like an outsider, that maybe you don’t try it, like they’re not part of the computer region, they don’t mess with in popular region, but then when they’re like deep names in the field, they start using these techniques and they’re starting getting the same kind of results, then everyone is like, okay, this is the thing we should do with this. you know, so, so I think that’s how it kind of happened. But yeah, it was, I guess sort of, Yeah, I guess it was a mix of Jeff Eaton being eccentric and doing these kind of things and getting out as a student, others being kind of the right person at the right time. So he’s kind of left with the other far as I know. Like, after Alex Knight, he’s not, because he was never into this stuff. Like, he did it because, well, he wanted it, especially. I mean. Yeah, when he’ so I mean I kind the paper really has to stay by another folks around were sort of more like Mus audience but yeah, that’s that’s kind of the the name. I also don’t know what internet came about, but I don’t think in 2016 network there. But I don’t know. Oh, yeah, yeah. Where those music that right? So where goes where you went to consider? Oh, okay, so this came about March later. Okay, so, okay, technically the 1st kind of geometric, the learning thing, is 2009 or earlier, by some Italian guys. There’s paper was largely ignored and no one paid attention. I think it’s like 2016 when the field actually took on. So we had there were like a bunch of papers that the rap competition in our networks. there was uh interaction networks that we did. There was, um, sort of a pizza, very paper called graphical networks, I think, or something like that. But that’s sort of what it kind of took, of. Um, I think people are sort of just a question about the right tiving. So the 2009 paper and it came up, out, it wasn’t the right timing because, well, the learning was still not established yet. So, and and this was uh, so, so dramatically learning, are these sort of the originals, or Gracnio Net, or kind of structure? It’s kind of like a weird mixture of neural networks. So something that really doesn’t look like new e because you have these gr factors so you have all of these spent non-parametric structure of a bit. So, it was weird for folks that were not doing neural networks, but weird for folks that are doing neural networks, because it does mix involved, and, you know, no one both languages. The timing wasn’t right and because what happened in after hugeet, a lot of the f in appealed went on a lot more like uh typical topics, like computer vision and language. Um, and like 2016, 2017, and dramatically learning support. Um, it was more about people wrenching off into other topics. where graph structures were sort of the predinant form of the data. So I think before that, there were just too many problems to be solved in compared reason and language for, and the community wasn’t big enough, which is for all of these domains. So, you know, that’s kind of how our fund. Yeah, from real time, you said, cookies or noodles, and yeah, uh, Uber. Yeah then do you say that 2 survived been give me the like explorgianona oil and Yeah, definitely. I mean, as a community, we tend to focus on whatever is high for the moment and now most of the research is around neural networks. There are certain aspects, there are some of the, right, in the refirmation of, of, um, learning, uh, self. um Okay, so let me just give you an example, but now we’re going into, like, this is all sort of panothetical now. I don’t necessarily have any kind of comfort with this. But, uh, one, you know, we talked about some generalisation hours and distribution. So we talked about this idea of learning the actual task, or, um, this has been, I mean, I might adapt to my mind about it much later. Um, so there is a subfield in domestic developing, but this type is there without branch out where it’s about, um, algorithmic imitation. So the idea is, I have an algorithm that I know, like soty, I know how to write it out, and we understand it. and I have an email effort. And what I want to do is, I want to give some example to that new network, for you to learn how to solve. Uh, but, like, I wanted in such a way that I can actually read out from, you know, like, the goal is to make sure that the other actually learns the source, right? So a point where I can identify, in the neural network, the computation that he does, Um, and I can, you know, guarantee that what he’s doing is some form of certain, right? Um So so this is, you know, this is the space of our in alignments and I going to take decision and so forth. And – and I think the – you know, the – the high level outcome is that – and then there’s papers about the result, is if you can’t do that. Like, unless you cover, you give all possible examples, like, you’ll never, even though in your network has the capacity to represent the working algorithm, it will never be started by learning, using very young child. Um, and that’s Like, one way people would say this is because the, yeah, architecture itself is not sufficiently aligned to the algorithm. Uh, whatever that means, and then, you know, this is a UV sort of handwriting. But there’s like some aspects of it that I do actually personally think are important. for example, an algorithm usually has part access to the data, like you know, as parts operations. Well, neural networks are very dense. Like in the neural network. So – so in an alori that we have variables and you have sparse access to the variable that we you are variable the whole time. When you’re on network, the equivalent of the variable is, like, even say, of the neural network, that is, but you can actually, you know, network. But you, in a, in a neural network, every time you do a stack of computation, you use all of your hidden space, like all your hidden space changing, right? There is no type of spice access where you’re changing your heater activation locally in one region. It’s always everything changes locally. While algorithms are very low from it, how they usually. So this is sort of brought up as one sort of difficult research. Uh, the other thing is, algorithms often want to increase structures. Neural network everything has to be engineers. And some of these things are sort very gang texture. Like maybe the global versus local kind of saving, you can work around it. But everything being continues, it’s sort of a very kind of like, you know, it’s very difficult to the architecture. Like we, there is obviously there is people working on these these variables and neural networks and so forth. But, you know, I can tell you, it doesn’t really work or it works as long as you have, it needs to be variables and and then you very well. Uh, so, so this would be a fundamental limitation, that would mean at some point in the future we might need to change something drastically about your own networks, in order to be able to have the speed structures or in order to be able to have sort of like local updates or like local interaction, which is like, um, And there is a question, yeah, so 1st of all, we’re not exploring that. Because, well, it’s a little bit off topic and people, you know, like this works. So there is this integation of wow, maybe we just need to sc things up more and work. you don’t have to be in the dispute factor like it’s sort of all on your eyes. like you know how to do the feed factor. It doesn’t mean that there’s no solution that everything you continous. We just need to search very private. So you know, exactly what you’re saying. Like people are ignoring these things because trying to do speed structures and kind of going into the non-parametric world, and people believe non-parametric don’t work. So, you know, I mean, there are some folks who are interested in creating sort of a mixture between your Netflix and unparametric, you know, Russian processes or other kind of things. So there are some folks interested in that, but there’s not many statement. People usually don’t like that. you know, they’re like, okay, you know, letters. So itreases respiration the way you said. Um, and we might just sort of just take, um, us skating for a point where we really can’t do certain things, and then go into an NI winter, for the community to be able to say, like, you know what, like, let’s be focus, and let’s, let’s not, not, not everything has to be confused, not everything has to be easier on that, but. Um, but I think it’s a very, like, um, natural sort of way to feel this enrage, right? Like, it all comes even down to the publication process. If you and then sort of a new process in the publication, right? The you are done by PHD students, who are just sort of now learning about all the vision involving stuff that’s excited about, like, you know, it’s like when you’re going outside, you give it for generic viewers, when you have a paper that goes outside of their comfort zone and started doing some weird stuff, they will have a tendency to read that, because they’re easy. They – they will have a hard time to judge it, but, first of all, they don’t have the context of where it is sort of compared ideas coming from, and what is time to do. And so – so it’s all part of that. Like, I mean, you know, that for example, true that I did learn, you know, there’s a name, initially, while basically driving a neural network, particularly able to publish papers. Because if we had a paper that had neural networks in the title or neural networks in the abstract, usually rejected, like, basically years, and not read your paper properly. They would be like, this doesn’t work. at some point in doing it. So, so we ran in things to be learning how, because now you are not doing your letter, you’re doing learning, which is a different thing. So, you know, so it takes like, uh, you part ignore the impact of all of these things, right? Like at least in academia, you can’t survive unless you publish. Uh, and, you know, to publish, you need to follow sort of what what is published. And you can do something that are led target to publish on the side, but like that can’t be your main focus. Unless you’re someone who are quite established and unfortunately whatever you want. you take up this same camera, right? Very senior people that you don’t care about papers. you don’t care about what other people think you just do whatever you you think is moving. And, um, yeah. I need to be, this will be a network. was applied to a problem and it won work. review this t up for ourselves and one. Uh, I’m just looking on this, what is, um, mood, via problems specific? Most of the, yeah, most of the problem. see is because of the problem. Or do we weight on the problems? And, uh, maybe, um, a DVD model application or something like that. I mean, Okay, so like generally the models are meant not to be probablyified the top, right? It is the the big selling point of – of – of of welin or machine is, you know, we have one anchitecture and can works everywhere, are formerly the one – the dead run is going signal all at the moment. Um, I mean, I think, yeah, I don’t know the, the participant, the, the, the, the, the binary, the, one company. Like I’d imagine it worked but it worked at a very small scale because one thing is not there, you have to train sort of new models. So shortly, and a very unique number of ways and stuff like that, so you can do the, um, I mean, and then – and in my mind, that is probably what I mean, is that the most, is sort of the scale. um because for better worse now we’re in a space where if you can’t run things of sufficient scale, basically the one they just use. So until like you can run sort of one of the dist. So, there – there is some – I mean, and – and I’m not sure if I’m answering your question. I’m actually not sure what the question exactly is, But, um, there’s definitely working quantising on it, right? And I really like the LLM we use nowadays, you know, there are ways to be their contact and then four feet or something like that, right? So I mean he doesn’t say does all early kind of. But I don’t think it’s necessary the kind of distive structure and thinking of. Like, these kind of free structures usually come from, uh, relaxation or a continuous thing. Like, this work leads fashion, actually, you have, like, a full precision model, and actually, the more precision you have is better, you train in that way, and then you have a whole process where you’re kind of, you know, this kind of relevation process process where you are, or, physicalizing your model, and you kind of move into the – the – But you rarely learn, like, directly for these. If you try to learn directly for these, like this, they’re very hard. So, usually, you take a great model of them, and, and unfortunately, back to it. Um, but, uh, this case factor I’m thinking of is, I’m not sure if this is, like, I, I, I don’t, I don’t know whether just having discrete weights used, like, what I’m thinking about these, is more of a, um, Yeah, like, it’s more like if you think about how an algorithm would work, where you have, you know, like, 3 or 3 and if you see like that, like, new networks don’t really work that way. They don’t recreate internal data structures and manipulate their new in different ways. And this data structure, they, if they do anything, they’re not sort of disputed given nature, like the way you could let me know, you have sort of numbering, uh, uh, thing. So, uh, yeah, like, you know, like, Yeah, so so I think it’s, I think there’s something about the internal state being always from the US, and like everything being projected to a continuous place, everything becoming recently. Well, you can go, like, later, in the sports, with remote, um, mystery, and the things will try to do this, and I guess, science, and all, like, one of the, sort of, essentially, new aspects that I need for, you know, networks, and, you know, with them, a little based on what they want to mean, all the intricate, and then rejected. And I have, um, like, for example, my, um, iPad, Friday, which I think, uh, Italian on network, is, is a, uh, to potentially generalise all this information, but, uh, I can, you know, if – if there is time, or, yeah, I can go use to my reasoning, though, why that people take stress. Uh, but now it’s a bit early to justice that. OK, so this slide is kind of just saying exactly what the other side of the saying is just sort of fine. I don’t know. maybe I don’t even need to stay on it too much. Uh, but it just started showing the metres and the the boom periods and you have the symbolified in the in the 40s and the 60s. Then you had the expert systems, and now they’re sort of technically in the machine learning group room, and, you know, there’s that question in the air, like, is this it? And over the, uh, this or is there going to become another winter? I mean, to be fair, um, Again, I’m not expert on this, but a lot of people are claiming that I’m making lots of money from there. And I think sort of at the point when you start making money, probably if I get to get into a winter. Like if you can still go in for a metre in terms of innovation, but at least you know, there is some technology that is not going to go away because, well, now this is how we do X. And, you know, X is a kind of service that we do, and people are buying, they can, you know, they need to use it. So, in that sense, it at least got to that stable point where it’s actually used the industry, sort of, on a regular basis, to solve, like maybe the nuts, like, I’m not going to say now take about the childpots that are sort of those, you know, the special things, but they’re sort of like more pragmatic things that are common with that bashi, where even the way the neural networks are used might look very boring and then sort of the best real to the community sectors. But you know, those are established way you’re doing such very specific things and those things are then of that interview. So, so I think there is, there’s still quality of the pressure point, I, we’ve had particularly one, I think, from a research path, people, and and and all of that, but I think it’s at least that is a very point where like, AI, now is a, this machine learning or deep lighting, that’s AI, is very, very tight. But deep learning, which is something very specific, is a technology that’s being used in the industry. So in that sense, you know, it’s not going to disappear from there. There’s no point of replacing it because you’re doing this Japanese doing it well. So, so they were at least were over that month, which is like, at least for this boy kind of stuff like that, it’s maybe not the case, but maybe not the same case, at least not. Okay, so, um, we’re going to start getting into sort of a bit more technical stuff, and then from here on, there’s going to be a lot more technical, though. I’m happy to, so, contextualise sort of information. So, you got point of view of any time when you read them, when it’s done, you know, the same? So, uh, to kind of start putting the roots of the kind of thing we’re going to discuss in the course. we’re going to focus deal it on the network uh planning and a supervisor. but there’s like lots of other things, a lot of them are not in the power. And I sort of wanted to say, like, there’s different ways of categorising machine learning, whether you’re posting, or not, your models, and what you’re talking about. You know, that’s what you’re reading and all, it’s fair on Apple. Well, if you’re looking at learning regions, like, the supervisor learning, LAL, self-supervised, not learning, the teacher learning, not providing, and, and, and, and, and, and other things. So we’re gonna cover some of the learning regimes because I think it’s’s actually kind of useful. I’m not going to focus on in your models because very typically, if you want to make theory about neural networks, you can see you’re talking in your models because you can make your body and all that. And then you hope that you’re equal for your networks. And we’re gonna go through some examples of that. Um, but I’m not gonna talk about here now, but, um, OK, so, yeah, I don’t know how this would go. But, uh, I’ve met you, many of you already, probably know quite as much, no machinery, but anyway, I only decided something very basic. So if someone gives you this data, right? So you have three squares, uh, purple panels, and then you have this circle, and they ask you, is this a square or a fan, right? How do you know about this question? Have anyone had any idea like how would you Andwise. Okay. Yeah. find that this group of the green network the crab. Right, I don’t possibly see that it’s a swing. Yeah. I know. That’s yeah. Anyone else has, or should I just go? I think that’s sort of a very natural way of solving it, right? I would have done the glasses in the same kind of solution, right? Um, so, it’s actually one on my slide. Because it is exactly the kind of sociality. So this is sort of using KBS labour, but the idea is you decide on some distance metric, whatever it is, here, it feels like a KBS experience might make sense. And also have to sign this for any other, different type of views. You compute the distance within any parairide points So you can you have your queries, you have all the other points, your computer distnces, you starting by the instces, then you look at it okay. So the ones that are closest to use for the 5 plus f and the data test. and you can look to see what pl they have. Are they square? Are they treadels? then you can be the majority class. because most of them are squ of my time is just Okay, so what is the card what is yeah. What is problematic, if anything, about this kind of approach? this is a this is a machine in the first, right? This is your, you’re describing this is a meta algorithm, right? Is that may discuss? So you’re not, you’re not saying, well, like, you didn’t have like 4 sides and whatnot and it’s a square and it has beats a pack or that. that would be the prescribed version of solving the problem. This is really like from day. It’s like, okay, I don’t know what makes the fire weather in the on p, What I’m gonna do is I gonna just look at my training set, look at these neighbours and make a decision based on that. So this is, um, and this is sort of the, this is the, the, the metal algorithm, um, and, um, yeah, solve this problem, but, Can anyone think of like things that could go wrong or like, what would happen if you apply this sort of, you know, much more, but a larger scale circling, you say, I don’t know, the tax phases in my English, or anything like that? And I have thoughts of what, yeah, how would you do exactly this week? What would be the 1st thing you just travel with? Like you get images of faces. Like, what would be the 1st thing to travel economically be South Africa? I think when it time scale of larger data sets you seem to have like Just computation issues. Yeah. Yeah, so that’s that’s all the big issue. And then I’m actually gonna just you in a couple of flight in the future. after this I’m gonna’m going to use exactly the issue to motivation the gene part matter. So one issue this is a non-parametric matterod. I’m one way to, okay, one way, you didn’t get there, but I wanted to appreciate parametics for non-parametics is that here every time I’m making the query, so I’m trying to find up another point. I have to use the entire 800 iPad, right? So I have to do this for those data, sorry, data set, do something to every data point to kind of come up with an answer. And this is a problematic, right? If I have like a video data point, so I need to do a phone for a video every time to make a decision, that’s not going to work well, right? That’s not what you want. You want to be able to do like computational one, like, you know, you want to have, like, a fixed number of patients that you need to get to make your condition. You don’t need to do this So that’s definitely one problem. That is, um, yeah. Yeah, definitely. Any, any other, I mean, there’s not any trauma, but yeah. Yes. It’s very sensitive to noise. Right. So, So, one point, what do you mean by noise? I mean, like, if one point is off, it will really affect the prediction, the decision that we get to. I mean, if you if you increase K, you become a bit more robust, right? So that’s the point of K. But yeah, I mean, if you do top K with K equals one, I imagine that you have an outlier and you get posted, you’re doomed, and that’s it. I agree that’s a problem. I guess sort of technically by increasing the value of K, you’re making the system a bit more robust because, you know, you’re not just… It’s not invariant. It’s changing a unit. If I measure one axis on kilometres and then I then I measure it the same thing in metres, then the Euclidean investors should be… Yeah, we will break the analysis. We need that an answer, just by measuring distances in kilometres from metres. Yeah. Yeah, definitely. I mean, I would go, like I said, further than, yeah, I mean, that was sort of the top of my mind, which is defining in the distance on itself, it’s crazy. It’s hard. You’re like solving how the problem just when you repeat your distance. Maybe more than 50% of the problem. So if I would go to these data set of cases, like what does it need to do? Like, obedient distance in pixel space into images? Not just a dimension, it’s just the round, the wrong business match. Like you, you’ll find. I’ll be telling you many teachers that are not faces, that are closer to an image of a face, than other images that have faces, because it’s the wrong distance, is the, yeah, basically the distance is really problematic. Like this – this method guys as thishetic. If you don’t decide, this is our, what, what, we are saying as well, if you if you somehow don’t have the right discipline, it’s not going to work. And then, you know, every time you change some property of the data, like you need to reevaluate what is the right distance metric, and you might need to change it because active change and so forth. Um, I wanted to present another idea, just so that they kind of cover a bit more things. And this is more like, okay, this is trying to make things a lot more probability. Like another issue that you have here is in the way that you frame, but this is on the next idea’ all it easy is you don’t actually the way every I was young algorithm, know they have a senseral uncertain. don’t it’s it’s harder to talk about like, well, you know, it’s about uncertainty. So one way to add the idea of uncertainty, I mean, there’s many ways to do that. The simplest way on this people algorithm is you look at your K name, that you speak, and you look how many of them are, one kind coming out, the other, and, you know, from that, you can get some kind of upright, right? If you say, there it actually seems to be, then it means, well, you’re uncertain. it could be either a sp signel. And comparing how many are squared, then the more certain they become, right? But that only so, we know it’s a different aspect, which is, we know it, that this is. I’m just looking at, you know, like maybe I have majority are squares, but they’re the furthest away, and the closest neighbours are actually all of the triangles, even though they’re just 2 of them. So, you know, so you might want to say, well, in that case, I’m not so sure I should say this is a square because the distance itself matters, right? And I think sort of fire and need those kind of going that direction, at least is sort of how I remember learning that long, long time ago, where really, like, this is now you’re not directly trying to do the classification. First of all, you’re just sort of normaling the data. So this is just a best you uh um uh kind of a way of – yeah,ing the best data. Where what you do is, basically, initially a training point, you put some kind of cattle, and just for a terminal. And then that kind of gives you like the sense of the kind of creative distribution, right? So you can you can see it here as well, right? So because this carry out overlap, the news in that area, you’re a lot more likely that, you know, the points in that area are going to be of that particular class, right? So you can, um, you can con- in fact, Um, and and obviously the – the issue we – we have in this kind of square kernels, uh, simpler kind of windows, is that you get areas where you have like 0 probability, which maybe is not what you want, and it can soften that by using lotions, which is sort of what people are doing practice, and that can be cautions, you know, they have, like, even support, so you have information about any point in the space, even though the, the, the quality, the kind of becomes vanishing, uh, picney. And and with this, you know, you can you can compute sort of your probability, your lives to be in a sweat, given the data set, by just, you know, um, from building this distribution, all your data points that are spares, and then you compute the probability of between the triangle, and then you can have this thing, then you can reason about that. And you can have a condition. What are we trying to Vietnam? Is it to get like a measure of the confidence of the art? Yeah, yeah yeah. It just gives you an uncertainty and gives you like the better way of reasoning a lot of the data. So that that’s why Johnny involved. So I was trying to compare this in a more proisticistic setup. Um, and this is sort of kind of the basis on globe building this and GGP and all this other stuff that I’m not improved. 100 is 5 So what’s going on here with this? So you wrong or this one. Yeah. Yeah, yeah. so this is like your about making your kal your point, your position minus is send because it depends on how you find pie to this is it could be a distance. Yeah, in in practice, this will become kind of the visual actors in this stuff, right? So you’re basically looking at I mean, this is just a fancier way of really just looking at the distances where that point and incorporating people, like a, like a wind sum, according to the distances, but it’s a, it’s a, sort of, it’s a, So what this is in trying to ensure by making sure of each of these files being eruptions, is that you actually get the results of it. So, you know, you got to take like everything comes to one and and all of that maybe I presume. Because I just, I have certain points. I just sort of ran and try. So you’re saying, oh, for this, I have 7 points. Each point has its own Gaussian, so I want to get a green card. So basically what I’m doing is I’m summing those things. and then renormalizing them. So I have to renormalize them so that this is actually tougher, uh, uh, just, uh, you know, that makes sense. Just sort of like a, I mean, this is not, you know, Netflix, but I just wanted to give you kind of a flavour of other ways of solving these problems. And to be more sort of potentially intriguing, but at least to me, um, uh, this kind of techniques are a bit more intuitive. Like you can kind of reason about what they’re doing and they’re trying to kind of delay it out. So I guess as you most in your life, right, students can become a little bit less intuitive. It starts becoming more and more of a black box. You have the black box, you refer it, you have the optimisation, you deploy, don’t know how to black box or where the data, you get something out of it with the inference and it’s harder to understand what’s like that. So I thought like, I’ll start you at 3 examples, the thing that what everyone did, and then maybe we’re gonna be able to kind of, uh, do something singer from your letters as well, kind of try to get to that. Obviously, this is problematic the same way, these are the thing of problematic. So here is the part of the window, saying, what’s important is the size of the carnal, right? So if I’m making this sort of red, so she’s trying making them too narrow, then I’m going to do very well on the training set, because, well, I haven’t one of those very straight in point. But then I’m going to have a lot of areas of very, lot of community where it’s going to be very hard for me to reason. So this picture is trying to to say, so this is in deep. So, so now imagine you have, like, little rectangle alert, like, I know that you should visit a point. Those are 2 small, you’re not gonna, you’re gonna do very well on your training, but, you know, you are, technically, you’re an overteacher to go into training data. When you’re have to make predictions, you’re not going to be able to make very good predictions because you know, you’re gonna have many areas where youre be very uncertain about everything. Well, if you make these 2 lives, then you start having issues even with your training data because you have these windows coming from neighbouring or holding data. So I was just basically trying to say that there’s like everything, there is no there’s hypper parameters, which is basically selecting the caral and depending on how you select that you it might or might not work on whatever task be. So so you know, there’s a to that. Okay. Hopefully that will make sense. So this method that I describing is a sort of part of the non-parametic family and it’s non-parametric because you always have to work with the data, right? So you have this data side that is describing your problem, and you always have to go through it in order to make decisions. Um, um, and in interest, uh, at least to me, interesting, the sort of non-hometric uh, scenarios feels like a search. So you always have to do this search in your data to make a solution. There these methods can be quite integive. which is a part that I like. But I usually seen as there is sort of not as much in favour. I mean, there’s a lot, you know, there’s places that really like this stuff, uh, like right up from the Cambridge, that it starts to GPUs and stuff. But, um, more or less they’re not seen as a driving corps commission learning. And the issue for that is because they don’t f up it and they don’t say up by the point they was brought up, which is that you have to go through all of your data that’s not going. So you need something else. Um, and alternative to this kind of scenarios is to use biometric, not certain. That’s when you’re at our side, and these are the ones that are dominating sort of the field at the moment. And the relationship between that, it’s complicated in the sense that in the, you know, there’s lots of work that, for example, show that even we may or scale, you know, naturally become an automatic method. There is a lot of work in trying to combine them. and and it’s Let me put it this way. They have very different properties. So it’s not like… It’s not like primarily methods are supersetical, non-primatic methods, or like parametic methods can do non-parametic methods. you know, tell you as well. And, you know, there’s no point in having both. Like, you know, you can, and replace, whatever you are doing in the primary group. There are certain aspects of like non-primetic matters like on the death, right? Because there’s no explicit, like, there’s no great dissent in environmenting matters, that you’re doing this idea. There are different properties of this that pharmacy methods just don’t have. So, you know, just basically I added this time just to say that you shouldn’t, like, they have very complementary properties. Therefore, even though that is happening in a very limited way at the moment, I do think that there is a point of merging this to kind of having a plan. because you know, they can do you want everything that they are recommand. And also, in my view, one area of research is probably going to take off more and more in the coming years, sort of mixing between this. I mean, what kind of problems do you need to do? I mean, the typical way people are doing right now is using the pharmatic method to compute some kind of embedding, and then put non-parametric on about it. than words. So basically use your primatic method to project from your input space to some other space, but it’s better behaved, where it’s easier to define distance metrics and whatnot, you know? And then from parametic, uh, non-paramedic, and of other space. Um, that is what being done more often than that. Um, I think more, uh, more can be done. The propagation is more interesting in things that can be done. And okay, let me give you another example. And then this is a bit more debate about, you know, how you, and again, like, I don’t know how near it one is, but like if you think about the Transformer, some people do argue that Transformers, eyes, sometimes a bit sophonetic and unfamily, because the attention acts almost like a non-parametric component. Because the attention, whenever you have a query, basically it does attach to the content by computing distances to every point in the subject. But it’s not really, you know, and you can say like, okay, we take this perspective, right? And we can push it a bit further, and say like, okay, let’s let’s make attention layer, you know, proper non-satimentary, kind of. So you could go that way. I don’t know if it’s that useful because uh the hardness is you have to propagate atients to that search process. So, you have to do it in such a way that it can still have a signal for the layers, you know. But there’s only an example where you’re combining these things a little bit more, they’re just putting one on top of the other, which is sort of the thing, but in our way of doing this. Um, but I think that’s kind of the place where I think people. You just, sorry, just make some understanding, this is the problem, because most people define the boundary between parametic and normalisation. I mean a week in book programme helping attention people. These weeks are? Yeah, I mean, sorry, where did that drive? Well, it’s this is parametric. Okay, so non-parametic one property that you would have is that if you increase the size of your contents or whatever, what set, whatever you are trying it. And they still apply it, right? Like if you have a weight for everything that you are, well, if I have an newvirment what would be way for that, you need to lie. Is it the same to the transformant that it will not take a time to develop that, right? int layer itself is Okay, so it depends on why you draw the boundary, why the essential size enough. But the, the, the, the, the key query is, there is the same key choreometrics that apply to every embedding, right? To the same transaction that you do, if you are. In sometimes you can think that the non-parametic component is really just doing like a sign products everywhere. So you’re just looking at this. Like, forget about the projection into the key space and various spaces. Those are just pitch protection. You can fold them in the layer below if they’re just part of the non-primetric component. And maybe the mechanism of the attention is computing, because science discusses everywhere, using softmax and normalising numbers of maximum-parametric, and then sort of averaging based on the software. Instead of going to do a top change in the top matches, you say, you know, I’m hoping this listen to the 3rd pay largest and then do something with an happy majority of whatnot. But you need a soft max because you want to be about the proper gate variants, you know. Um, it’s – it’s not that much of a distinction. I’m sort of trying to, I mean, it’s like, you know, I don’t care whether it’s easy to probably as look, everything is hard to grow, sleep, boundary, away from it. You cannot have a. I, I kind of described it this way because I feel like this is sort of maybe a more traditional way of doing the field. There’ definitely in the in the field there having the view that you know you have met and there’s like books upon non priarily approaches to be machine lin and then you have groups about primetic methods. had usually seen a distinct thing. And at the moment, usually the intuition is a primetic matter to work much better. And then basically everyone focusses on automatic matters, specifically neural networks, rather than, and you look at, you know, where I am. especially let me on networks. And, you know, so that’s why I went with this kind of distinction. In practice. I do think, okay, so maybe just sort of a high level. I know fine in practice, but if you want to do research and you want to type with this matter. The, the, you know, most research ideas, they really come from someone who is familiar with 3 different art fields and basically just taking an idea from a subfield and applying it to the other. best by far the most, you know, uh, and then, you know, there is a lot of creativity and a lot of, uh, it’s not that you know to do, but, you know, in that sense, like, in our mind, it’s, like I was trying to create some parallels or I’ll try to play more parallels, if you have parallels, but, because the idea is don’t try to box them into different things and say, well, you know, this is just a parametic, this is just for non-parametic, and they have, you know, usually like when you want to be invented, it’s really just about looking at both. I say, well, like I do, do this trick but there, I could do it here as well. We find we design things one way or another, and that’s sort of how how you play new ideas and your methods. But yeah, I mean, the ball is not a key time to say. I mean, you can try to draw a boundaries, but it’s matter the point is not 50 I think you just be aware are, don’t know, boundaries and so forth. But, yes, well, in India, I would have, you know, a, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, the uh structure of the is that you non was for a B forometric or non part model is more than hopefully uh It’s hard to say they all met by the way. Like, for example, if you have Have a single layer model and there’s a lot of things people not be able to do no matter what future. So, um I would argue they they might be on the same level or what be like they they they – I mean, it depends – depending a lot on domain, like, I don’t know, if you have a very specific question, and, like, a very specific scenario, maybe it’s easier to judge. They say, well, this – this thing, the – the driving force behind the performance, you see this or that, but, you know, they – they can all matter, by the way. I mean, Basically, okay, yeah, yeah. I Like, if you, if you have a good people’s practice, for non-familitable, then then they can be on, but I, I, I think, so. They can look around, say, well. But, yeah, I, I, uh, I would say they all matter the same. I would say that typical scenario people with 1st speaks where they are. So in most cases it would be like, okay, I’m using world networks all Max. But then you’re not even asking this kind of question because you’re already kind of set of animals using this part on your matters. Basically, you look like this, it’s a layer, architecture, that’s combined, I don’t know, MRPs, and then attention, or if it’s a performance style, or you just combine some revolution networks, or what, none of conditional areas, or something else, you’re not even asking this kind of money, kind of, because usually you’re in a much kind of narrower regime, and then you’re kind of just cleaning there, and then you can use that. But, uh, um, there was one impact they can have. I think it’s done the same same world. Okay, so this is still part of the intro part. Okay, so okay, so what are these part about these methods? But then this is what we’re doing, we’re going to read the basic stuff, and the whole point is that if we, if we go through the basic stuff, then we advise stuff, there are more code is gonna be there to, to, to understand, and, and, and then we’re gonna have much more information about that. you manage to do on the same thing. So, maybe, maybe, like, taking a step back, like, what are we doing here? So we’re trying to find this new good predictor. So what is this good predictor? This is a function where if I give it an input, so this is supervised learning, right? I use a useful X, I’m expecting to get an out of the Y. Um, whether that why is at last this is classification soput an it just part of timeangle. Um, Or, or it could be a regression where, you know, I’m pretty surprised of an object and there’s sort of like a real number. So that’s kind of the difference between, you say, modification and regression, everyone is good predictor. And you have some data. You have some examples which are vari of inputs and outputs that you expect. and like what you’re really looking for is sort of this metod, algorithm the new data give me a function age. That’s sort of what you want to do. And in a non-parametric scenario, the way I would playing this, it doesn’t matter, algorithm, it’s defined on the power set, or what possible data sets, and an observation, and easier labels. And this particular formulation I is trying to say is that the method that you get out of non-primatic architecture. So this power set thing. So maybe I shouldn’t have used age. So this is the machine learning algorithm. This is a myal version. The what you get out of each other, the inference that you have to do, go from your data set. But the whole point here is whenever you want to make a prediction, you need to have your data set, but examples, and your prediction or your imperence is really some form of certain procedure that goes over that entire set and makes a decision. And whether it’s looking for the toasted neighbours or doing something slightly different or some process of that, it’s fine. But it’s that sort of in reasonable how it works. And then the parametics scenario what you’re doing is you’re replacing your data set some. So really what you’re eating, what you can, what you can think of it, is that really you’re trying to compress your data set in the Disney data. That’s basically the process of going on here. So instead of just doing this search for the entire data sets, you’re having your parameters and you’re compressing your data set, those parameters, and then the advantage now is that you don’t, you know, this better, okay, we are very big. This data is space. is a big size, so the cost is fixed. So you know exactly the thoughts you have to pay when you’re doing a prediction. And an example of a paratic model could be a linea function, right? So, you know, you take X, you multiply it with some WN and some of your output. Yes. How is is it all known part of achievemental style, is the entire average activity? He was a. Yeah, the… Most of the crazy ones do, right? Imagine you can do all kinds of tricks of not necessarily, um, I mean, there are different varies of answer in your price. If your, your, your, um, You don’t necessarily have to go through the entire day cut to make your search, right? There, you can, you can, you know, if you have a place, make a structure for your native ID, um, you know, that made your search to be way more efficient. If you like login or whatnot, because you’re not absolutely looking at all examples, you have this data structure that allows you to navigate the space in the right way. So, do you know that some data points are gonna be very far from what you’re looking for? There’s no point in checking and puttinging any businesses without. So you can definitely do all kinds of streets to make this very fast. It doesn’t have to be this extremely slow, uh, end that I’m trying to sell to you. Um, and you can also decide to throw data out. you can also decide which there some data for in I form of another develop that you have. So there’s no point in keeping them around to kind of make your memory, or – or, you know, how expensive your forward is. Uh, so, obviously, in all of these things exist, and and then they can make things by – by the efficient, and There are, there are sufficiently many parts of the community that would argue that, you know, non-parametric are really parametric. So it’s not like I’m trying to build this sort of picture here. So by the way, most of the time when I’m going to try to build a picture for with accelgerated. So, don’t believe everything that I say. But I’m trying to say this picture that we have, this are very sluggish, not parametric method, and then a much better parametic method than everyone is on the parametic, uh, model, uh, uh, uh, side, and, you know, they think this is the best. There’s not really the case, you know, the non-privatic art, there are particular scenarios and places where they’re much better. They’re very useful in low data in gymms as well. Um, Because, um, it’s much harder sometimes to do this process of extracting an optimal data when you have, say, a 100 examples when you’re trying to classify something. So then those like kind of scenarios where, like, very well thought out non-primatical methods actually were methodic. So there is like many pointed like places and places where non-primatic things are, are still pretty appreciated and there’s lots of, there is huge literature. I mean, as I said, they’re like groups now actually working a non-partnative met. It’s not the same as, um, I don’t know, there’s some other topics that are maybe a little bit more niche that are kind of really hard to do. Like, I don’t really want to do symbolically, or something like that. That’s probably gonna be highly identified groups that are very active in that space. N-p ofametic machine lining you would probably easily Like I was mentioning, like a temporary place, and to be obsessed with someone, but it’s okay. so you’ll find probably other people doing all kind of things and stuff. So, so there is, um, that now there’s as unpopular, as I make them sound, and there actually is to make them more easy. But, uh, it is a biometric method that are sort of state of the art in most domains, but even when you are in a live data region, like when data is you, when you have sufficient data, and that’s not usually the bottom-up of what you’re doing, um, more or less parametric matters like the one, um, that are performing the back. But I think methods can perform very well in building regions as well. If you have auxiliary data that you can, like, we train your model on, and then you just have a couple of examples on your target tasks, but it’s sufficient because it’s retrained. and then you have all of these clean but you find experiments of project and log it so. Um, Anyway, I was just trying to to set up like a variety that makes sense and and – and it’s easy to kind of do. Uh, okay, so this is the paramedic model, so the idea here, uh, this is sort of the basics and from this onwards, we’re probably just going to jump, you know, on that rest, cuz that’s one of my questions going to be about. But the idea is that we’re looking for a meta algorithm that will help us find destruction age. The draws from Pita, my parameters, and an observation adds to some tradition life. H itself, the structure of disfunction, defines the family of the Marvel’s family, the family of okay, that sounds proper English. So the model family in architecture and an example is a linear model. Another example is a neural network, like a transformer or a DMLP or whatnot. And in machine learning third, when the time we lied, this model is H, what we mean is we find some data star. Besides that when I evaluate HDPAS star and some data point, it gives me an asset that is correct for as close to being correct responsible. So this is the line in staff of the models. The 1st thought is the architecture, recognition, and the 2nd study that you do there, you know, which is basically just this search process, which does start, which is sort of your optimal parameters. And in the field, usually when you’re trying to do these things, you can have 2 different concerns. One, is a lot of the speciality. So when we’re talking about the specificity, you are worried about speaking an architecture, and this usually comes by different sort of where this Tita lives, like how many, if you grades you have, how they’re arranging architecture, like what is the structure of, you know, network and so forth. In such a way that the model is that is expensive enough to express whatever you’re trying to do, right? So creativity is basically the diversity of functions you can get by changing the back. And you want to have sufficient diversity so that the function you’re looking for is within that set. So this would mean proximity, and then the other, uh, side of things is liability. So we talked about inability, is sort of how Ava, are we, given this architecture, and doing the search processes that we have in British, which will be great in the sun, to find this Kita’s card. That’s always the problem, right? Um, so usually we use sarcastic radio for that, but then the question is, you know, how likely is for grade and design to be able to discover this deep as part from the data that you have? Um, and then these are 2 2 sides of of the same point, but they start to find the GVT worry about when you set up your model. A lot of times, you know, each other, they treat separately, so you’ll find a lot of work that are just focussing on a facility and they’re asking questions of, you know, can an island do X? So you have the fashion of 10, 11, X, and usually you approach this on exclusivity for interview, and what you’re trying to do is to show that there is no configuration of parameters for the other land that you’re considering, such that you can include the tasks, right? So yeah, this is sort of a typical avenue of approving things. And usually if you’re just focussing on a specificity that the sort of different things, a different reason, and makes it easy. Uh, the other, the other side of things is on ability. Um, Usually there’s not that much work, like, group of 12 papers if you learn an energy, but they interact with each other. Like you can have a very expensive fam your models. They can express anything that you want, but we’re learning is really hard. And because of that, you will never be able to discover any new resolutions. In the same way, you can have models that are very easy to learn, but they’re not expressive. So you cannot be something. like that. The idea moving forward is we’re 1st gonna ask of the question of what is the different regulations? What I mean by that is basically what is a good architecture, what is a good choice of model family? And then we’re going to ask the question of how do we find the start? It’s a bit, um, and the answer to both of these questions is going to be, you know, on that right, and then you do that, and that’s that’s the way that you said, uh, good purpose. Like, you know, are, are biased, are structured, right? Well, we’re gonna start with a linear model. because it’s a little bit easier to understand. gonna play a little bit of your novel to understand sort of how this would work. And then from that all way we’re gonna start talking on neural networks and why would you want to join for dinner your models in your network sort of mod out and how they so for. Um, so, your modelX basically align. Um, And earlier model, uh, so the size slide that I borrowed from Antonio, I get those course, um, on authorisation, but, uh, you know, New York Mall is actually quite, quite easy to reason about. So like how would this work? Can you have your data ads, you the matrics or big data point, uh has Ddimensions as the features, and you have some fact that why? So some real number that one for food for any of these data points. and then when you want to build these models, I’m kind of repeating what I said. soart already, is you’re looking for this FWOX which is really just X multiply. I think it’s a sort of set up a notation, and and therefore, the output that you are, which is this Y hat, and that you are outputting, it really doesn’t weight in some of the features on that side, right? So this is sort of a outlook. So the question I’ve been trying to answer now is how would you find the W for it this? That’s what you know. I mean, that’s kind of, uh, so, um, if why is a real number, if your data, SMI, we’re gonna go back to classification, because classification is sort of where we started, and we’re gonna talk about having the classification with this kind of thing. But for now we’re gonna do the pressure. So let’s assume that why the real numbers? So you really just your data point and this, this thought, you know, really what you’re trying to do, uh, is a relational password, and I’m trying to discover it, and like. By the way, how many of you are quite familiar with this stuff and this is going to be some important for you? How many do you know being education and all that Yeah. right, so I’m. Okay, so and then you have high dimensional or become the hypperame. So this is we can come back if there is like any surprises, but I imagine this is not an surprise here. So Um Okay. So, um, there is this question of biases. This is 2.00s as well, but when you want the hearing. Otherwise you applies advices. When I try one theory, it’s very annoying to have that button. So usually fold it into the weights, which is by augmenting your features with a one and then, you know, you buy it trans away. And that is just more notation than anything else. But like by keeping everything to be to WX, like when you’re doing it, it simplify your equations and everything a lot. So it’s just sort of, you always can say, without lots of generality, I can drop the bias and I can just go forward. So, okay, so this is sort of what you want to do. So how do you how do you getire there’s a one way is to minimise the distance and here we use min start error to tell terms the most natural decisions when you have your number. Um, and you’re trying to look, for example, your side, that the loss is as possible. Okay, so… How do you do this? Well, assume you have some XW start, because why? You can multiply both sides, the X, and then you get this X sample, X on this type, with extra sports Y. So now this this metrics, you can work, you can move forward, then you get sort of your, um, your solution, which is, uh, just X, transpose, X, the minus one, X, transpose, Y, and this is sort of the, um, the inverse. Or another way of being 2 ways of doing this. So this is why we’re doing this. I don’t know how interesting it is, but it just sort of as deriving manipulation or you multiply things and everything. But the other thing is, you are just observing that you’re lost in the photographic, and you know how to solve the battery, the back, et cetera, the room, and the gravity is zero. So you just got a idea close to zero. And then when you said, you know, your friend beside this, you said this thing goes to 0 and then remove the So you open the brackets and you get the same thing, you know, the soil on the other side and you get back sort of the same the same formula that that’s where we have. All right. So I’ll go to the next episode. So so far we’re talking about in your aggression and I showed you sort of how solving irregation in performance, which you already knew. very much like that. Uh, so the next was the most the probability in that attention. So the reason for this is that we want to move to our specification, which, again, I probably, and then it’s just something that, um, this is a high level. Like you can do more of deep learning without necessarily caring about mobilities, but if you different things are probabilistic framework, it’s very happily in many scenarios and you can help you answer all kinds of interesting questions. There are much harder to play, more understand, if you are not on music, uh, scenario. I mean, there’s also like for lots of people actually framing things in a probability, by statistical manner, it’s more useful because they write it, some people. I’m one of those people about like it. I don’t realise the statistics are probabilities, but I don’t know how many of you are very, you know, statistics are preparing things in us, just because very more. so that’ not. So at least uh I wants like them. But it’s something that you can’t leave your voice. And when it comes to the method that it process things, you know, it’s’s called statistical things are a big part of it. But, okay, so you have art in your models, like, I assume many of you already know this, but is there anything since, like, uh, non- non-what you would expect than than raise your head? So I need your way to take this near model and compare it to a classifier, you simply, you know, you use the line and the decision boundaries, so you’re basically classifying it, you start to start with A or B based on – on the value of whether the values like are smaller than zero, right? So this will be one way of playing it, and then, you know, this will be, okay, this is all, every time publications, but you can you can make this in a more publicistic, um, uh, play much. I assume, I guess how many of you know likeistic integration? I assume I can kind of relatively go faster in this part as well, and I guess I should have maybe junker activity in an effort. But, um, but there is a different way of doing this, and that is, if you think about it from a more probability place. I, again, I borrowed some slides from a transport course. I got to go through, and the idea here is to introduce, Sigma, it, and what you’re going to end up getting out of it with some form of regression, right? Which is sort of the basic. So disappoint, the generalised continuing amount of time on each other side of kind of the models you get on top of the linear, architecture, and there’s sort of the bases. And a newer networks are an expression of this, where you move from the linear to the nonlinear by adding some this narrow-wise structure that has no immunities. And that’s where most of the magic happens and sort of most of the research is in defining these types of layers, how they, what they view to the space and, you know, stuff like that. So we’re gonna get that inst lot. But if you are still sort of in the disregulation in your case, or you have these observations, or this is going sort of very step-by-step, it’s very kind of no levels. So, be boring, but, you know, you have exercise, a teacher actor, right? all of these X, X1 XN where they represent different features. For example, I think maybe it’s on the next line. The example they’re doing here is something like movie classification and you have this sort of detector, it’ll say, well, you know, the work, worse, what do you, what do you, so run out? inside sort of the teacher, that you’re going to use to predict, predict the sentiment, whether it’s supposed to be a negative or completely a movie. So your output is 0 one for like positive or negative, and then your features are some properties of the of the of the review that you’re doing. So exactly what I’m saying. So you know, you have the video containing the word awesome or at least one or one. These are sort of your your thrillers, your your teachers, and then you wanted to say everything is possible energy based on that. And the way you do that is you assign some weight. Um, I’m running away with different words, did they appear in the, in the text and based on that, you can do some kind of condition, right? Um, so, so this is sort of the problem statement, and then we also can have multiple classes. It doesn’t have to be your normalal or long it’s usually U. So the way you can do this is you can, you have to wait in some of the features that are telling you, for the features are zero or one, if some particular word happens, you do your way at some, you can have a bias at the end, and then you get sort of these value Z. And if G is high, so you use negative weights for tinder, negative, but you wait in a positive. Like I said, well, this is positive and high and musical. It’s good positive uh sentiment, United, zero. But you’d like to make this into the classifications, or from releasing frameworks, and a reasonable authority, and you want to do other kinds of interesting things. But. So what you’re doing is you are taking your senior, push it to a function. that they normalise things between and one. And then once you’ve done that, if you do it properly, you get out of it is a distribution. And then the functional use is a cigarette, so it’s one of our one class, right, that’s minus zero, 600 functions, and the function looks like this. So, I can see if something is very positive, because to one, something is very negative, those minus on, has this particular shape and you can, you can, uh, So why now is the probability? Because between your N one and has a very nice, uh, See? So the idea is you do what we did before, but we pass it through this. And there is a nice thing here, it turns out that, well, POY plus one, equals one minus POY plus zero. um which is because of the form. So sorry. So, one of what is match is fine, show you, is to say, to show you that, well, this 2 are one man and the other. which you talk about the fact but they need some to want. Do you have only 2 positive values and some. they kind of respect this property which you’re looking for. So now, you know, if you, you can, you can train this into a, if you just need to make a hard prediction, you can use the same thing, you can compare it to 0.5. And in fact, what you get out is also, um, and secondly, other things, right? So when you’re looking at why, you get of the probability of the sentiment of the review to be positive or be negative, right? more minus what you have. So these are basically the comparison from, principle, in a model to something that has a producing interpretation in the classification scenarios and the binary classifications. So not you by using logistic function that you sort of re normalising your output in such a way that you can now use them as probabilities. And then your decision boundary is going to be as .5 for dollars I can. So, man, yeah, I mean, there’s a lot of nice properties of this particular population and this particular graduation functions. So like this, this point here is also the unstable point. So, you know, the amount of learns, it prefers to go to the extremes rather than staying there, which is also something that’s quite restored, because you want more of the half. Yes. I’ve seen a few times that in order to group some models, predictions, we can do this decision boundly. Is it? Is that the problem and isn’t that it does produce better results, or is it not just about in the problem that the model is about you just probably friendly? Sorry, can you, I maybe missed the question so the 1st one. tuning the the decision back. Um and also the classified. Um not this mask the fact that the – the class very by class? even though it does produce the results. Is it just that you for performance? You would mean, like, maximum project kind of things to make, not even necessarily, like, yes, yeah, it depends on the solution of my data. maybe, like, moving the zoom value from .5, so there’s no one percent of the other, and, you know, some relevances of my first time. Oh, Oh, I see. like people playing with where, where do you see your boundaries instead of using 0.5. Um, I mean, I, I, yeah, like generally, yeah, okay. So generally this happens a lot when you have imbalanced data, I think, and that’s sort of a 50 scenario where you play with this. Um, I, I mean, I haven’t played as much in that scenario as I should have tried. My, my, my, my knowledge is a bit shagy, but I, I, I, I imagine that by just um, subcentering the data or or doing other kind of tricks to the data to correct for the imbalance, would, would work equally well, leaving the decision boundary where it is. But it might also be that these things are actually quite equivalent. Uh, I haven’t quite it out in my mind at the moment, but I feel like whether you’re playing over it in decision boundaries, or you’re playing a bit of the data, to make the data be more balanced. But th- they they kind of did the same thing. I would trust more, uh, not playing with the city boundary, but actually trying to figure the data, particularly if you’re using data augmentation, which adds something extracted, instead of just upsetting the very popular class, and then reducing your own data set, which is not ideal always because it makes learning harder. But if you have a way of absenting by adding their augmentations of all time, I would. Yeah, I would believe that would be the optimal way of dealing with this. Playing the decision boundary, I agree with you, because of the dodgy, but I think that’s exactly the same as the substance, you know, in the technical sense, particularly for the logistics regression, right? When you have, uh, at least, you know, an attraction, it’s become, maybe, really well-matched, which is good for, yeah. Uh. So, I mean, this is the point. I always, I mean, for, you from, like, uh, uniform, like, this. Um So I’m not sure, I… So, what do you mean by me from my computer? I have 05. Yeah that’ like beautiful So show me the data some are distributed for me thatat one quite likely. Yeah, yeah, yeah, yeah. yeah, yeah. that’s usually just somethingion that debating it in all of this. So you assume that your data set is balanced, where it means that the likelihood of, uh, a priori, without any information on the data for yourself, you assume that it’s equal probably different. there are point to be of any of the classes. This is the standard assumption that done and uh in information. So you have the imbalance kind of scenario, which is usually seen as a niche, and then you have sort of all kinds of techniques. But most of the techniques in the imbalance case, when there is some, like, prior on the distribution of the classes, they are usually just about making the data set balance. Most of the techniques are like, okay, more things in such a way that you have legal number for all buses, so that’s exactly what you meant. And then – and then the playeration. I expect that my application number, we should be in for the bus, we can reprove. Um, So you’d expect that like, uh, If I implement on the balance, how… Yeah, yeah. I mean, you expect that like, okay, if you don’t know anything about it, without looking at the data point, the property of the data point to be of any of the class, in the VTD, yes, that’s that you’d likely expect. I mean, there is other things that people do like when, but a lot of, okay, so maybe this relates to the 1st question. So, like, one thing that you can do is you can pick the model, and then maybe you can play with the decision boundary, and then kind of move it around. The other thing is, you leave the decision boundary where it is. And you just learn the weight to push the distribution to be another one. So, like, for example, one thing that people like to do is to have this max marg- margin, sort of lining, where basically what you’re doing is you’re pushing the weights when you’re learning in such a way that the decision boundary that is present, this decision boundary, ends up being sort of furthers apart from both classes. Like you see, if you imagine the classes as being coming in some some areas, like you want to cut exactly in the needle between those areas, right? Um, So you could – you could do the same by displaying with where you put on the same boundaries, but usually people don’t do that. In machine lining, usually, actually, well, at least the part machine I think I’m aware of usually don’t touch this. You only touch W and B. So, you’re, you’re just like tweaking your learning, um, you’re adding regularisation terms WNB. such that you push out of the learning to discover. So I said that this decision boundary is where, like, where you want it to be, without you, like, playing with this number. A lot of times you don’t even need to make decisions. For example, in another level, right? Like what you want to do is you want to stand for. Like, basically, there is not even, like, what, the – the desired output is just the probabilities. It’s not even really a decision. It’s up to the probabilities that – that you – you need to be more welcome to something else. Okay, I don’t know. Hopefully this is clear enough. I mean, yeah. Okay, so you can you can do this street to do that. So now the interesting part is how do you do the distant traffic? Now like this so you can apply news by er as well. Yeah, that is the wrong wrong thing to do in some sense. because it doesn’t take into account the time that you have this probability. So it just feels like its around loss. And then there’s like senses, you know, changes around lots, right? You can see that by seem to be playing with it. Like, say, the right loss would be negative, not black people. But you can see like what happens if you take a system like a SC one, the software, and you have to layer, and you try to try to make better. So it’s going to try. train? they are a bit dog but performance well. I mean there’s like I biggest reason for this I hopefully will get into some of that or you can anation. But, um, in in, in machine learning, usually there is this concept of a matched loss function to get relation function. So usually the way people hear these things is I guess, this is from the pro going to you. Spending what your choice about, too, like whether it’s a linear model, or the output, or whether it’s a signal or whether it’s a softmax, or it’s some other type of activation, usually those leads to some probability interpretation. So if it’s a linear model, you assume that you output on your model is like a dalshan, discussion, distributed, it’s, you know, it’s a signal itself. I norm. sometimes they want you know. So um So because there’s a provision interpretation of it, then there is a loss function that is matched that probability interpretation. And hopefully we’re going to get there, but the whole concept here is that the, uh, matched, uh, loss, uh, for for that particular process, interpretation, just comes from, um, basically using tile divergence, as a decent machine, too much, Ontario. So you can show that if you use kayal, as a business, or maybe right away, and then you work right, you basically get back out of it. Maybe you can go back. this that so important so. So it’s it’s basically just because you’ve made this probability of reputation, you have to now respect the policy of reputation and come up with a, um, with a distance between, you know, the taste of the contribution that you use. So you have the distance and internal distance, the further, but you have the distance, the tile, and given that distance, you know, and given the proposing assumption that you made, whether it’s adoption or a binomial, if you work out what that means, you get sort of a simplified loss, which is sort of a match loss to that. So, in a much more probabilistic kind of framing, that would be the way you would approach the problem. The way I learned this stuff, it’s usually you have a set of occupations that is always linear to the regression, it’s always software, it’s really presentation, multifacetation, or person-wide, if you’re going binary presentation, and then you have the losses, and you just learn in my disk, everything. If you use see my use business US of nice. At least that’s how I line them. I, I, I learn or I figure it out by leaving books much later, but there is a reason of why being this lost for that activation function rather than just so I can do whatever you want. But usually if you just learn sort of group that way and I’m sort of how this means sort of. Um, Okay, so let me let me done. So the way, you know, you usually kind of entire decided sp rice is from the port I haven’t written the sl myself. But the way you get to this, if you want to get to medicine, go back, if you say, well, like, what really am I doing here? Do I have this holistic interpretation? So I know that the zimo is taking a mobility. of why viving on a certain class even your obsation act. So what I want to do is I’m looking at the likelihood of the 2 labels. So you can write this product, you can easily see because Y 0 one, then either be one or the other, is that this product simplifies to something that would make sense, right? If a true label is one, then the quality of this is just white hat, which is sort of what the mouse is, you know, the problem people in this week, they were, what, I don’t know if you guys remember, but that was sort of the convention we had, right? The model, the kinds of probability, and in the probability of the past being one. The prodigy of the class being 0 is one minus that thing. So, you know, if why is one, it’s a flip, and all you get is quite happy, probably of the time being one. I mean, the true level by zero, then this time disappears, you just get the 2nd term, and then 2nd time in just one minus by half. which is over here over the past be zero. So this product is really the probability, I guess here it should have been POY had equals Y U the X. So this is sort of what we want in Mexico. And we want to maximise this because, you know, you want, um, You want the, the training points in your data set to have hydrogen UP under your model, right? Your model very competident, he says we’re training point part of the correct class. So this kind of makes sense. Hopefully it does. So you are a maximum. is this? Uh, where you do it, each again, it’s convention and uh, makes things easier, uh, on my table, and you take a lot of this, because, you know, you, you just kind of usually move into a lot of place. So you know, it’s a lot of space to just the alternative formula here. And then you will also additionially have the sound this overall your trading fund that you have. And then again, this is an organimization problem, so you need to maximise the load. But uh I guess what people prepared machine money again, this is convention. I think I haven the opposite. want to maximise results. I mean, machine learning, you always want to be a model. here is friend around relation. So, you know, we have this, but here, the ball, uh, we should maximise it because we want to maximise the building of the direct class, even your observation. So if you want to go back to some organisation problem, we just add the minus on the front. now now I have to nameim I it and then this turn out to be just for profession and then the sort of the – the reasoning of how you get a person. So why is person to be the right objective is because, well, my person that he’s trying to do is maximise the probability of the correct label or only in the training set. So this are kind of a individual step-by- self definition of that. And this is sort of the kind of correct of the difference that you need to have. And this is present, I mean, this is trying to go back into the results. I just I don’t know you know, terms if we go travel around, but it notget has a minus loans like people white. So that’s – that’s a very tough. So there is a, I wanted to also mention that you can get something, um, a lot of different way of kind of, there is multiple ways. Once you move in the, I think that’s why people really like to paint things from closer point of view, because lots of type things are worse in point of view, there’s a lot of communication that I do and guess, overall time, it’s, it’s a kind of, very flexible, pretty much, which is what you want to do. So I I hear some flights there where you can kind of read write the same thing, but start from the main r. Um, where, instead of training, so the way we train the problem before was we wanted to maximise the probability of the trill label over my training itself. Here, you can say a generic way to kind of get your objective is to say, I want to, um, find the probability of the correct ways, given my data, and then I’m using the base role, aside from this, to get back to the trust centre pilos. So you can say I have some data, and what I’m looking for is, what is the probability of, of uh, you know, what is, yeah, probably a few times you have this, this data, but, uh, so I wanted to get some of these, but it’s key, because I don’t mind, but here, and there’s more of this must here, we might get a start, so this is the most likely, uh, other distribution, is sort of the pattern, it’s talking about, characters that I’m looking for. So basically, okay, so the returning here is that’ll be more derived. kind of assuming there exists a keep us stars, that’s solve my problem. but we assume what we assume that there is a institution of Aita and the distribution tells me how likely is that te have to be sort of the optimal parameters even the observation that I have, right? And the assumption here is that, well, I mean, this is sort of different in practice, there might be multiple people stars that would be equally likely to give the optimal things, even in the data that I have. So what I want to do is I want to learn sort of this stuff here. I want to find my distribution of our TTA given the data that mean how likely is or just up to be the of mobile test. And the way I find this is that they need the page rule and I provared things factwards. So, like the base rule, you know, it is, uh, you know, able to be, but still be the name, but it is, like, you know, modify with, uh, the weight by the recording, by being messed and most around.. So I can use the base role to invert things because I can, you know, I do not know POP type even data, but I know PO data given a deduct. because that I can just evaluate the health. And my model, I have my here so I can do that like what is the pub date I can push um uh the inputs from my model like in the outputs and I look to see how likely the 2 labFi data, right? So this is something I can confuse. You tell me, I do not know. this is what I want to do. obviously the. So, the whole point here is that another way of being driving this whole thing is to start from the base road. Sorry bit but we start on the base road and what you get in the end is the same objective as before, right? So you end up for the same kind of object. Um, So, you know, you do the reversal and then you just sort of, um, you know, again, because you want to minimise, you put the minus the drone, then you take the load, and you get a negative of light code, which is what we had before as well. So there is a difference, I don’t if I left it here, which is if you start from the Beijian perspective, because of the paved rule, you get sort of this additional term, which is um, okay, so because of the base, you also have this, uh, in the from the page room, you also have the probability compared over your parameters. So this is just ignoring it so you, you know, because you’re being sort of propertyation. Besides, or just, you know, you have probability of your parameters, even the data, because the probability of the data between the parameters, times the probability of the parameters, without, you know, in knowing anything about the data, dividing up the probability of the data. So this like a parameter, this is your prior. This is sort of your belief over like what I’m doing for parameters, parameter values, many, which right working, without any data, right? So you can have a learn from the prior, so you can say, you know, everything goes. There’s no assumption, or you can have any foreignology prior. You can say, like, I believe my diameter should be small in order. This is otherwise a difficultical thing that people would do. And then that’s just because we like small numbers. So say, you know, and but this is like it has nothing to do with the data. It sort of just your general belief of what I mean for value for your parameters. So you have anything like this and a proper Asian person with art, you should always have a prior. You know, you should always have a belief, you know, even if you, before you’ve seen any data, you always believe what is likely and what’s unlikely. But the, the, the thing is, if you start from the base role, and you’re trying to derive your objective this way, which is sort of the more proper way, so to say, when you apply the log, you get plus a term, which is sort of log of the probability of your parameters, right? So by jump ahead, the thing here in being is comes from the base rule, is that thing Yeah. it’s that thing that you have the minus because you put the minus in front of the whole thing because you instead of maximising our minimising. and this turns out you know are going to be your regulararizer? And the role of the regularizer is to say that you can have multiple task that seem to work for your data. You have some preference. Which of those people you would rather have? So say if you have multiple values of data, then you use your error. If you have the small non-prior, you basically say, well, give me the one that has the smallest magnitude among all of these possible data that means your error. This is a way of ranking sort of possible solution that you have. Obviously it’s that doesn’t necessarily work like that in practice, but that’s sort of kind of diseat it. And then you have this next turn which is just the pro of your data. But this usually disappears because, you know, this does not depend on your parameters. This is just something about the data. So once you have this thirdd and you tell the very endength of it, you have very any sense, this time is suggesting but the variant of this respect data is zero. So this time is usually ignored but itgn know because you don’t have any control over our data. Data is the data. When you’re 5 or 8 is 5, but, you know, data is given to you and you’re not controlling the data or anything like that. So this is not a fact. The only thing you’re controlling is the moment. The only thing you can do, the imaginative is not by because, and it can also look at your prior, and it’s a prior to, so it’s not a data basic solution. So let me see what the next slide, please. Okay. so it’s prearily and I say this is sort of this andction. So we can take half hour, right? and then we can continue the next dollars. I’ll try to, so manage the name, okay. let me let me offer me. So mine is sending guys like quite everybody simulation and this kind of stuff. So I might kind of keep my hand and need the ne network person next session because I feel like that would be more interesting and better of our time and I added this section because I didn’t know exactly what we said, but I I see that they nice view point of it. Uh I might mentioned the uh I might might keep the point about the cayak because it’s kind of useful. Maybe a slider to it end up into the new network stuff afterwards. and then hopefully sometimes I’t always gonna speak up and it’s gonna become a lot more like want do that
LECTURE 2:
Um, and, um, these are, like, 2 standard scenario, that people, they are, for example, what, you know, what, some do. One is they cannot separate the 2 moon. Which is obvious, like there is no way I can rotate this line and move it around. The boundary is not union, so there’s no way of hitting a line of humility. Explorer. And maybe the very… I don’t know, I can explain the high baseball, thinking of the… kind problem which is the same reason Things are not very acceptable, right? This is You can do that however. You need to go protection your day, right? Like, you know, like, here, you have uh, sometimes with circles, there’s no line item here, and stuff like that, like, means, versus with the rent, and it doesn’t exist. But if I project all my data points to just Arabias, or, you know, how far the data points from the centre, using the centre and zero here somewhere. Then, like, and then I look at the, the day after this protection, you know, things are even really simple. And everything is fine. because was called Uh, so you have to try… I, I, I mean, I just think this is very intensive. I wouldn’t be surprised. I took this from an example from another that they were trying to give an example of like, I wanted a projection that I can easily write down mathematically and it’s very. So I went through the radius and said you want to know about the latestency. But I wouldn’t be surprised if there is a particular name, but basically, they just need to put it differently. I think in the 90s, computer vision in the 90s, maybe early 2000, it was all about this. You know, they would come up with all kinds of projections, much more interesting marijuana, I wrote. But all kinds of mathematical function protection will call them features. these are feature engineering, and you’ll get an image, you compute this sort of battery of features, and then you put your linear model on top of it. And that’s sort of what computer vision used to be. And, you know, you have shift features, another kind of feature that are exactly something like this, right? That you continue. Um, And then region here is that, um, if you take your input, you push it through projection, some non- some nonlinear subtraction, that’s important because why is it important? What happens in the product in un? doesn6 the ring I think. Exactly. So in the projection of the lunar, you can modify these to linear things and, you know, linear to a linear linear. So that means they should have been a linear solution to start field. If, like, you’re defending your projecting your coffee. So exactly. it doesn’t change the type of thing that. So it has to be nonlinear, but you know, if you have a good, non-biniar places, then, you know, then you can put sort of your generalising your normal jobs. It be intergation, your logistityation and whatnot and things like, well, and then this has been what people have been doing for a while. But the realisation or the fact is that picking this correctly in a transformation is hard. And sometimes it’s basically solving 90% of your task. Just coming up with the protection is basically solved the thought by hand. So there’s not much to learn afterwards. in some sense. So a big a big question in the field, and this is sort of working around at the time, G is, um, can we learn? Can we discover the projection itself of the data? Because we don’t know how we already want the right convention is. Um, and then this is all about your methods are trying to do, or, um, and, and, and, you know, to do this, you need to pick a, a way of formalising what structure in the protection so that it’s not above, right? So you need to have a structure for it. And the structure is sort of this layered, architecture of murals. And this is inspired from biological models. So the way the field has started was in sort of the neuroscience, more or less, where people are trying to understand how the brain works, or it doesn’t, you know, the field hasn’t started by people asking the question, here we learn a projection to make things really inseparable. They were trying to understand biological neurons. So biological neons, I’m not a biologist or neuroscientist, but you know, I think everyone knows this. They have this kind of shape, they have an accent, and they have the Zendra, and they’re sort of connected to each other. And so the other people have done is sort of the death track are collecting inputs from neighbouring neals, so there’s sort of this rated sum. And the input, this input is being sort of connected together into the sub operation from here there is and I’m going to go back to it. There is something more, which is there is some no any transformation that is happening within the Europe. And then this is sent forward to other un know. So that’s the other. So this is kind of the structure of, you know, so you have a normal idea that is, I’m gonna talk about it. And furthermore, the break is very… I think this is a very sort of traditional view of the rain, the particular part of the visual pathway. Um, I think there’s questions about it because the brain is is layered, but it’s also highly decurred. So you have sort of feedback connections throughout the brain, that we try to be more in this sort of way. So there used to be, I don’t know if they still do it, but there used to be sort of this stoyish view of the of the of the visual pathway where you’d say, you know, the information goes from from the eye, it goes to the UR, or V one, it goes to be two, from 23, it goes 24, from before it goes to the IT, and, you know, things happen from then onwise. So it used to be that like you have this year ID or stages in the how vision was processed, and this used to be sort of what people would argue that would happen. I mean, do you want, you do edges, a knife detection? And then in detail, you start looking for shapes, and then in B4, oh, there’s no D3. So, as I said, so they skip these And before you start doing objects and in I you have faces, right? So there was… A yerarchy, a laird yerarchy of how the pathway works, and there were some of the semantical differences between these layers. They were composing, you know, there were like building up more and more complex features. And this is sort of what the neural networks are trying to mimic. So this is this are the intuition from the I mean, like a lot of it, I don’t know, 60s, 70s, before that, 50s, uh, in how the brain works, and this is sort of what, what, uh, the neural networks like in the day were trying to mutate. So you have like the… What was it called? Neuro, neuro, commuter, new. Exactly. Thank you. That was basically inspired by this, right? So, um, okay, so this is the architecture and it’s, it must not be here. Why must not be linear? we already talked about it. If you don’t have the nomeniality, when you layer, this architecture together, like it all collapses, because media composalenia is linear, so there’s nothing there going on. So, a key point is to have sort of this on inality that you have after you put in the sequence, right? So, so this is the inspiration and sort of this is what you’re in order of are trying to do. They alert a inspector by doing these steps. But of course, you have some bunch of choices here, how you do this linear, what you’re doing these weights, and how do this in your projection, and how do you do the long, yeah, one choice of them, you have key, either. Um, and um, So we ext these choices are hard to try. The way people have started was we used for the symbol linear test in your production, which is what we do nowadays as well, we say in an MLB. And then as activation functions, they used to use sign modes a lot. And the main reason why I signal it was kind of the starting point in playing with this is because the signal it has a probistic interpretation. So the way they would see it is sort of the same way, but you get the probability whether this neuron should fire or not. And the whole point was, instead of, so biological systems, negation of service, you know, they work in impulses, they don’t, they’re not continuous, right? They have these trains and the pulses, the days before. like in the artificial one of thegin one that because you don’t know how to do this kind of speak me so you want contin your values. So we replace all of that with like the probability is higher. then very touch interesting might especially more kind of American point. But obviously these choices join out the idea. So then we’re gonna take a bit more math perspective to kind of get a sense. And in particular, um, Not necessarily, this is how it felt right, but But in particular, I think one activation that is widely used right now is real, instead of simoid. And we’re going to talk about why people ended up switching to level protein weight and so maybe what was the, what was the, what forever before? and sort of what I did to the using weight. But a simple function is the Ralu. Um, and the reason I was going to start with Brau, and then after Delco, Resimo, it is because Brau is a bit easier to understand, mathematically speaking. So, you know, that allows us to be. So, okay, so if we have, like, a single layer of a neural network, following this sort of 5 under this, right? So, as I said, the visual… The visual, you know, had this, like, 3 or 4 strages, or like, few or 4 layers. But for now we’re going to just start with a single layer because it’s the be sort oft to understand. So if you have a single layer, like what happens is you have your input, you do some in your projection, so here’s frame as a protection, but we are, and that plus B, you apply rather than is applied individually to each neuron, because that’s sort of how the neuroscience, the neuroscope as well. So then the non-reality happens within the axle, within the neural sub. And then you have your linear model on the top. So what is nice about this architecture? Well, 1st of all, you can you can now solve problems that work before unsolvable. Um, In particular, the random model, it has a piece-wise linear boundary, and it appears like leaner boundary increases because you could draw this kind of season boundary, that will separate the 2 modes, right? Um, So, uh, maybe I’ll stop on this like… Is it intuitive? Why it is a besides linear product? And For this particular problem, like integratively, like, would you even know how to set up the weights of your model to get a decision boundary that looks like this? All right, then you’re having regions about how these things actually work. So you can. the coming flight we’re gonna get to this. But I just wanted to make sure that I, that I think, I mean, maybe this is obvious. Like to kind of get an intition of how all of this things work. So, um, You can imagine, again, like we were typically kind of talking to this perspective where we have sort of indomain and out of domain. Like we really don’t care about necessarily the behaviour of the model. I’m entire real accesses. We care about the behaviour of the modelels here where we have the, right? This is how we start thinking about. And then when you’ll start thinking about the key that it one at a time. So you can say like, okay, a starting point for the decision boundary is I have sort of this straight line, right? Which is sort of maybe where the linear model starts and, you know, up to this point. So I can take my 1st human unit. and say, well, it’s 0 for negative numbers and decide because I don’t care what they’re not doing. But then from this point onward, it does, sorry. I should not have the same. Um, From, from, from 0 onward, we have to be straight by it, right? That is fine. Um, and then I can do and ask what is the 2nd li does. And I can say, well, if even the 2nd heat in the match, I can I can use my bias in the linear protection for my 2nd season unit, and make it such that the 2nd hit community is 0 up to this point where I want to sparkle in the 1st few sides in the net, right? So I can say, OK, up to this point. Actually I will ask all the to be 0 off to that point at least. And say, for the 2nd unit, after this point, it becomes active. And then once you become directly with a linear function, and you just need to be a linear function site that added with this previous one, you’re going in the slope to that. And that’s something you can compute. I mean, I here, I’m just doing the high level infl. But like you, you know, you know how if you combine to linear things, you get a linear thing and you know what it would be, right? So here you just say, after the 2nd after his point, all of it, the 2nd view of it is basically as the delta that he needed to change the slope to that part. And then at the next junction where you have the next, please, we make the 3rd feeling and then become active. And again, we can change the linear as well to anything bit more. And then in the next step, you get your foot and so on and so forth. So if I have like a one B problem and I one by hand to construct a particular solution, I could do it this way, right? I can order my heathen units, get them to activate the different points on the real line. that’s where I’m going to have sort of the the joint between 2 in your pieces. You know, I besides linear function. And then every time I have, like, energy, the internet becoming active, I can change my linear slope anyway I want because I can do whatever I want. And I can’t go backwards because when it’s a function, but I can I can change it to anything that I want. At least sorry, I can construct sort of any kind of piece-wise, at least in one week. I can do this in high dimensional, but yeah, you can. So do you mean that every new one partitions our space into 2 regions? That’s what you mean? Yes. So that’s piece-wise linear. That what it means. Yeah, PCI is in here, it means that locally the function is in here, it’s sort of a composition of, of, of cleaner pieces. That’s some of the functional exercise, you know. This, but the way this happens in each neural, it’s the space into two, and it changes the linear behaviour in the moment, just means. Is that why we say we can represent any, uh, content with a single there? Yes. Yes, yes. Yes, exactly. Sorry, I have that. So, um, this is not a proof. There is there is proper proof that you can read and that technical, no, that. But this is kind of the way to the universal opportunities. here, which is exactly way to ask, which is that travel network, I mean the universal approximation from 189 from they’re all 1 monings. But the newer ones are for, uh, well, because we just got back in the day where we want, I think. Uh, but basically what they’re saying is that, um, If you have an infinite number of units, if you get approximate temporary as well, any service. And in the rail case, this is a one D case, you can see how that can happen, right? You can can start if you have an infinity number of min pieces, you can ad them any you want and then you can follow the other, right? So this is, this is the, the universal approximation of theorem. So these are the big things. Um, Sorry, they were, they started some document, I think, it’s a big thing about exciting. really got a lot of people working on normal letters because of it because people were like, oh, you know, this is the right, right? It actually says a lot less than you would think. It’s actually not informative at all. as a theorem. Because it is, um, It’s a, it’s a, it’s an argument about expressivity. It’s in the limit. So it’s no way in our means you can never have it practice. And it doesn’t say anything about their ability. Like just because I can express those things. I mean I can learn them with any kind of many hours well. Any particular, that other things that are different for personal interests, that we don’t use, like polynomials, you have, you know, you can do whatever you want to call it. We don’t like polyonians, you know, we’re waiting for non uses a nice data, you know, that’s right, because they don’t work well. So, so, uh, It is useful, but at the same time, it’s very unusual. It’s not a reason why we neural network. unless I said other thing. But it came on the back of the previous AI winter, where you had a sword problem, things like this, where people are trying to show that, well, they think you can’t express. Well, this year, people were saying, you’re not going to be able to find another example like that, I can express everything. but Yeah, the question, the PDS, right? Yeah, previous, sir. Given that, as I said, we have one, one, I think between the number of things, more boundaries, or drawings, and the number is in units, within the same day. How does it translate if attends that 2 layers? Exactly. That is the next sl about. Can you just understand itly? Yes. Yes. I think also? or? So you get exponentially more linear pieces to prepare you make the model. should more profession. Should be more efficient, yes. So I’m gonna go on all the details. So this is exactly what the size. So you have this stuff, right? So this is all about the female people like I said. But in did like in look at the music church, you talk about that, right? If you look at any of the paperapers even now, it’s come about having very demarble or 20 class layers or 100 there so forth. And actually, you know, when I said there was this rebranding from your own efforts to be learning, this was basically the boundary of the dread, prevent, you know, like a deep lighting, and a deep new method that had at least 2 layers or more. while neurural e used to get similar. So that was sort of how they rebranded and how they pushed the field. But anyway, that was very important. So then one big question is, why, you know, what do you get? Like if you have universal opportunity with a single layer, what is the importance of that? What does it provide? And the answer is basically that it becomes in sometimes more efficient, but then it will turn out to be allowing for motivation as well. So we can. But maybe before that, um, How do you do this? Like how do you how do you argue? So the way you typically argue, and I’m already kind of in start the position, is you find a metric that measures sort of the expressivity of your model, the flexibility of your model, right? And imagine that you typically live in a red architecture is the number of linear pieces. So the number of trainp points or number names are whatever. I don’t know what the write image is. So feel free to correct me if I’m using character the the sound of where. So if we started a single layer, you can actually have the maximum number of of linear pieces that you have. And this turns out to be a very known result in math. So maybe some of you provided know. This is for the, the last is here from the 17th. Does anyone know how it works or what is that? Do you even like Okay. So in the match, guys, well, the last team was trying to visit, I don’t know if you’re on the third, but that, especially is not German matter. So this is not a missionary. So the last 3 was trying to do was trying to understand if you have N hyperplanes in the space, into how many regions can you speak the space? This is the question when the last is trying to answer.. So he is more about hyperane arrangements or line arrangements and plane or anything like that. But then it turns out, as someone pointed out in the session before, that you can think of a hidden unit, a splitting a space into two, right? In the region where the hidden is 0, where is the region where it’s active. So it is called active and active. So you can think of it as being a line or a hyperplane, if you’re in a high dimension space, right? So then the question is Rashi is asking is exactly the same, right, right? each unit unit is the hyperlane, then sort of how many regions are you split in the space? Well, how many visions you get by have, uh, an hyperbase? So I’m just going to bother it because I keep, sorry, I think it’s kind of interesting and it’s kind of related to what’s going to happen with the people. So it’s good people don’t know it. The proof is actually quite simple. Now, I mean, I’m not going go through all it. about just going give it the large levelation. So the way the proof works through by induction, and it relies on the simple facts. So, like, let’s do it into the, I don’t want to do it anymore. right I miss spaces where gets messy. But you can look it up afterwards if you’re really into this one. But in a play, it really gets into account, the fact that 2 lines cannot intersect in one person. That’s all that you need now. It’s 2 lies, the money is likely one point, and if they buy that change. So if you do you start releasing a. you can only stream the space too. We have 30000 business, that’s it. So now you have to eat the minutes. So you add a new line. So you had only one line before, if you had a new line, that line can only get exactly the line that you had in one place is. So then you can count how many new region it creates. And it creates 2 new regions. And then you go to the 3rd unit and you say, okay, I have 2 lines, I have a new line. This line can only intersect the other, so you only have 3 intersection points because then you have 3 points. And then at each intersection, you can count how many regions get added. And you always assume so, you know, this is sort of you always assume that you intersect with many points. you can to maximise the number of regions you create, every time you add the new line in the in the plane. So that’s kind of like how the construction works. And actually, doing the proof, it’s quite easy for me to know the outside. And you just need to work up. You just see how you, I think design is, is that does that make sense? Why you want?one okay with that? construction? Okay, so, so this is very nice because in, well, 1st of all, it gives you a way to achieve this sort of maximum expectivity, maximum partition of the space, then it gives you an exact number, you know? So this will be the limit of what you can do with a single different layer, or a peak size model. So now the question is, what happens if you go deep? So what is What is the crucial limiting factor in the number of regions that we get when we do the last 60 year? So the limiting factor is with the fact that 2 lives cannot intersect with one point. The number of intersection points is limited, and you get a piece number of new regions per intersection point. So the growth of this thing depends on how many places you can get in place at. So what happens when you move to a divide detector? Yeah, okay, sorry. How many how many of this makes sense? So, it’s it’s a little bit, like, if you just look at the architecture, you know, it looks at a bit scale. But one thing you can do is you can look at a hidden unit on the 2nd layer. So I’m counting layer from the bottom up. The 2nd layer is after you had one layer. So what is the function that a human unit on the 2nd layer designs? What is it like mathematically? What does it look about? I think it’s a function of the 1st layer. Yeah. So, but what is, yeah, compound function? No, it’s at, I mean, all of you arrival, like, the word I was doing for, is, they have a pizza. Oh. So if I look at this before the activation function, maybe even if that close. If I look at this unit, and I write it down, it’s one in an ARMFP. But you just le your direction I’ clean your attention. That’s that’s’s a human image on secondary. So then if I’m if I’m looking at the name of the space, how is it going to apply? Like if I if I ask the question, when is this unit going to be zero? If I want to draw, yeah, the function, you know, H to Y equals zero, how is that gonna? It’s just an MRP. What does it mean? Why are people sometimes that? I use Europe, but like people sometimes what is that? I never mean. So, like, what do you, like, okay, if you go backwards, then you have, like, what do, what do you call, like, when you have an output of an RP and you say, well, you also do for 25, and we draw that, what is that? I mean, this binary classic. That is the decision already.. Yeah, like if you say, well, what is this your name? that will look like a decision about me, right? And it just looks at the decision boundary repor a single AM P here. And I said that this good side of preides. So if I take a 2nd unit or uniform 2nd leg, there is just a one layer in my feet. And if I ask the question, when does this music give us a const, It doesn’t matter the constant. That’s just sort of something that we can. I can tell you one people 0 when people prep five, it doesn’t matter. What it is, it is, that’s almost like a decision boundary. and then is the T function in the eospace. right? Does that make sense or am I lost many of you? Can you say it again? Can you elaborate? Okay. Okay, so, if we have a single, then we have an MLP. or a single, right? One layer, I’m going to give you the single opt, right? And we want to draw the decision boundary of this MRP. So the decision boundary, you know, it’s sort of a very overloaded term, but we use it in terms of prediction or whatever you do. I mean, a decision boundary is asking the question, I mean, if I write them on the board. if I have maybe if I write on the word, I make write the but you’ll find out. Um, So I have H1 equals like WR. I’m gonna skip by seat because of we devices are not important. And I’m going to use this one for it, I will, because it’s easier than right in well all the time. So it’s, But instead of WL, I’m called WW, you… Um, 2 W one X, right? So this is this is a one layer of health here. without biases, there specific hurting. So as I said, we can have viruses that help, but the Indian amount looks simple. So, decision boundary of this model usually is asking the question, where is age one equal to some transcript? So this is the problem you’re trying to solve when you want to draw the decision language, right? So, for a linear novel, when I ask, when is HR, we focus on constant, this would be a line. And it’ll be the decision bound. I mean, that line will be like .5 and then you brother run. Well, one there, I’ll be. You can, you know, we can do the constructions we were talking before where you can look at the difference even when it’s that are coming here. So basically we put all this stuff Z and, you know, you have, I don’t know, you have K units from D one to Z K. Each, the kind of place the space into a line, because each, each unit, ZI, is a linear model. because it doesn’t have any vibes. The linear model GI is sitting a novel. It’s a linear version. So if I’m looking at one, GI equals to zero, because that’s when relative kind of switches behaviour, you’re gonna get a line. So each of these units is alive and it splits the space into into parts, each line, right? So now when I’m nearly combining these things, I get my piece sizing a function, by doing that trick that I would say, you know, you have all from Z2 to ZK, all of them are zero. It’s a Z one that’s active and that’s alive, then at some point G2 becomes active as well and you can change the slope and it keeps. so when I’m asking this, but I get is I get a precisely here comes. I need that moneyor. So that is the, the functions you can expect from a single layer of LPU. Now, if I go to a 2nd layer, so I have Yeah, so let’s say, I got age 2, equals W3, or Sigma, W2, Sigma, W1. my name for about. Thanks. Hop my my name. So this is just repeating the same thing over an idea. This idea is just sort of age one, I guess I called it, right? Because the thing come about. This is H, right? So, um, I mean, you said Chunk here about the state. Yeah, I get a bunch of H ones. I will leave exam with an index up word. So I get H one one, K one two. I get 10 of them, right? So I have ad minutes on my 2ndary yard. Each of those units is a different function that looks like each one. So each of these H ones here are a piece of a linear function by because they are just this guy, right? Does that make sense? Okay, so then what is the process that we’re doing here with W3? So, well, with W3 and the Red. So the real, what he does, it splits the space again whenever H one is negative versus H one is positive. So, but, but, uh, But the way it does that, it basically leads the space by a PC linear function. So, for example, this blue line here. I, my subnotation, so what I call a drama, the board, the college to hear. Sorry for that. But this blue line. This blue line here. is the besides linear function that corresponds to one unit on the 2nd layer, and it’s placed the space into 2 regions. The one on the left and the one on the right. So if we go back and try to answer the same question we’re answering before, it’s a less key to them. We have a very similar missionary. We can bring back we can do reduction, we can start by just assuming we have a signal unit on the 2nd area and then I keep adding them one by one. The only difference is that now instead of having lines, we have piece-wise lines. So what is how does that change? What are the properties that we’re expecting with with the theory before? Is, understanding correctly, we are spreading the only related space, so between a square number of these index resources, right? Like if I have been the best day, like 8 for 5 and the 2nd they have K to point and between 25 different tool. Your… Not Italian, but very correct. You are almost on point. that’s sort of what the project would be. So like just to some of the more fantanticic way, like just working into it, you know. Like what we care about in the past is was, in how many points and to life can intersect, then it was only one? Now, the question is, you know, how many points do piece-izing or functions can be designed? go. And the answer is many more. Like as many as you want. Like if you can, if you have the choice of defining the value you want, you get a lot more intersections. And whether the truth of it is the same, the time you intercept, control so many regions you get to add. So the trick now is you want to get this piece of cycling, a punctuous intersect with as many points as you can. And the more the intersect, the more I mean your pieces will get out of it. But at least number of intersections isn’t founded by the number of between units in the second day, in the first. Yeah. I know how you found that. Yes it is. So the question now, so this is the sort of the machinery that’s being used. So the question now, by knowing by tracing the number of units in the layer below, is like, can you build a construction that leads to a more intersection that you can get from a shallow model? So the, you know, like there is a lot of question about what is fine and what is up there. So like one simple approach to make things easy for us is to say what you want to control is the total number of units. I have the same number of units in the channel model. We’re just saying the keep. So if the shallow one has plenty even in it, the deeper will have ten and ten. And the question is, if I have 10, because otherwise I said, I did, one keeps hiding more time, because I’m not complaining. So if I have that and that, the question is, can I build a construction that 10 and 10, that leads to more injured that a longest 20 can have? And I know the number for the one is the last ever I can plug in instead of BM 20 and I get the p number. And now the question is, how do I construct something? What I’ve exploiting, that this, this, this engineer function, this is set in multiple places, to get more regions. I mean construct something close to what you said, like get sort of thisonentialial growth. But this also the machinery I’m trying to kind of get you guys to think about it. This is the construction. So this is a paper. So it looks a bit likely, but because we have the time, I want to try to get a trade because I think it’s actually pretty well. And then we can we can take a break after the construction, because probably people will be tired, with this kind of condition. But but I really like this kind of stuff, so hopefully I don’t know how many of you like geography and this kind of stuff? as many? Okay, so hopefully it is a really, really bit along, but I think it’s a really nice part of the, of, of this, because when this kind of exhaust started coming out, it really made a big splash, because people were very unsure what’s going on. I will doubt at the end why this action is non-informative. It’s an interesting construction. Everything is nice, but it turns out that the reason we’re using our networks is not because they have more linear pieces. It’s actually that’s not important at all. And that’s not the reason why I didn’t have perfected by the international one. But for a long time, people believe that’s the reason, and then that’s not what we talk about. Okay, so how do we do this? So, I’ll give you an example if it kind of becomes. Um, So, I will not use Remus because Remus are a bit too hard. So I’m going to use the absolute value. I’m going to have an activation function, you have to have value. So after that, you know, if remove the sign, and I argue any other mother can mimic the author value by using 2 levels. So if by decide and get the absolute. So it’s gonna be. So then what is the construction impact? So, Say I have the claim, so I have Exxon and I so, and I apply up to the value of the both of them. What am I doing is, you know, the, um, boundaries where you have zero, splits the plane into padrans, so you have 4 padrans. So when you apply to that, you’re actually folding the ps on top of each other. So the point is, if I have, uh, sorry, if I know, I mean, just read this and write this on the bottom. I mean it helps. If I know that the, um, on the 2nd layer, our college age to, is I get the values 37, I don’t know if in the input I had 37 or I had -37 or I had 3 -7 or I got -3 -7. All of these points get mapped to the same thing. I cannot, once I applied after value, I don’t know which one was it. And this is what I mean by folding. It means that any point, in any of the other 3 quadrons, get mapped to a particular point in the 1st class. And when I’m looking at this, I don’t know which one can, which, which, which one was. So then if I have this function that now leaves only in the 1st background, so I have this new function now. And on top of it, I put another linear projection and sort of a logistic regression. I tried to do the decision boundary. And I do a decision boundary that looks like this. So, the sun island chucks. If I take each of these points and I want to trace it back to the input space, this can be any of these 4 points. So therefore, basically when you trace it back, you end up copying this line symmetrically into this world further. I don’t know if that is potentially 9 So the whole construction here is you take the space, you hold it into a 1st programme, you use the biases to shift things back to the negative values. And then you pull it again, and you shift, and you pull it again, and then when you draw the cision boundary, the number of linear pieces is going to get multiplied by 4, you know, every time you go backward, and it’s going to become highly symmetrical in some kind of weird kind of ways. We’ll get a decision one way that it was very weird. Um, You can actually, you don’t have to, maybe they do, um, um, wait after the value, you can just start from 10 years. I am worried that this is becoming very technical, but you can spl it into many segments by holding segment wire and this is sort of the constion for that. Um, So, so the way, I mean, this is the actual value thing. The way you could work is your techie. So I mentioned independently, and then you start doing your sweets, and then you connect. So this is a very kind of structure. So this is a proof by construction, right? So what you’re doing here is where setting up the weights by hand in such a way to maximise the number of holes and the number of spreading, and it does exactly this thing that I was saying, right? It treats the space like this and you keep doing me like this. Uh, And sorry, I was looking for the number. No, I don’t have a number. Oh, because here. This is early numbered at the bathroom. This is the maximum number regions you get. So it has an exponent and there is a product to the number of layers. So I, this is for the paper, it’s not for the class, but you can, you know, you can buy, you can look at this now. I don’t say I but we can look at this and you can see that as you increase the number of layers this is becoming a condition larger than what the travel model can do. And the whole point of this theorem is to show that the upper bumper, the model, is a lot more linear regions. Another question? Yep. Can I group this from the previous theory that you mentioned? Yes, yes. I mean, this is exactly the proof. So the proof is that you are exactly our least. So the absolute value is a way of constructing this, bicizing your functions to intersecond multiple times. It’s not the only way. There’s actually people constructures. You know, everyone comes up with their own different attractions. I find the absolute value the most interesting way because it’s sort of highly it’s easy to visualise mathematically. But all you’re trying to do is to come up with a scheme of how to define the exercise in your function. So as they forcefully intercept in those many places that you can. And depending on the on the scheme that you find, you get different bounds. Some bounds are better than others. But this used to be a treat for a while. People were playing coming up with schemes of how to transcribe these things and give the number. And this is the more magic side of machinery. Um, So you get this number and the number is bigger. That is not the most interesting thing. The thing that I think, sir, this XY is… The thing that I think it’s interesting and maybe maybe I’ll leave you with this. Maybe this will help, is it offers you a way to understand what a neural network does. And it’s not exactly correct. I can go in why it’s not the exactly correct? But you can think of the layers of you thinking of the deep lighting is doing some kind of very Italian. You can think of the deep layers, which is folding the space on top of each other. So when you’re trying to decide a very complicated decision boundary, what you do is you find a way to follow the space. So that when I do a linear, because of the output layer, it’s always going to be a linear decision company, right? The output player is linear. So when I draw a line at the top layer, right, when I go backwards through these holes, deadline is converted into the highly shaped kind of decision boundary that you want. So this is how Mechanically the deep lining works, right? So like all the layers are folding in the space on top of each other in weird ways. So that when you draw a line, when you go through the folds and you unfold it and then repeat it sort of a mathematical way, you get the decision boundary that you want them. This is sort of just a really bad problem from by time. Does that make sense because that’s kind of interesting. yeah. So can we say that instead of learning just the classifier, where actually the deep neural network is actually learning a representation of the data that makes it easier for the classifier to split, right? Yeah, that is what’s happening and the way you land, the presentation is to this definitely process of, um, yeah, I mean, that’s why data keep. I mean, that’s the role of the layers. But in the end, yeah, it’s a different representation by that senior thing becomes a deep sort of. So you have a question or is similar? All right. So this f but we don’t want to far is the one that goes for construction I waslining and I’m not to do that, but it’s – it’s not that sort of important. So I had a question on, okay, I have a question on, maybe we’re going to take the big battery session. But the question is, yeah, I don’t know if anyone can answer this. It’s actually easier than you think. But a question of very okay. So this is supposed to be a feature of how the space is being spl as by sounding your own effort. Um, or point outside, so that’s usually important because we’re only talking about chasing your own efforts. So the question that I’m trying to ask is, if I go on any of these dimensions for infinity, so I’m going on this axis order for infinity. From some point onwards, how is the behaviour of being on a natural and on the life? there’s something that can be described. There’s something that can be easily set. Because anyone has implication. Can you repeat the question? Yeah, so I have I have annual that, right? There know about the c. right? But they acup one dimension case to make things easier. So I’m only by one of the dimensions of the input. And one time, you said the behaviour, one defined on the network on the line, it could be easier, so you don’t know what I miss. I’ this is the plane so that’s why I was still. So it’s in the final one, it’s on the line. What is the behaviour of that function as I go towards infinity? How is that function in another back? from some point one? Is there like is there any pathology that you think is going to happen to the new network? Yes, Iasticron that work’s going to be a straight there. Yeah. So they because the number of the regions is fine, right? It has to be that from some point homeward, you’re gonna stay on a linear piece.. So the way the way the space is going to be split is into many linear regions of finite volume, and with some regions on the edges that have even the volume, they set infinity. There’s no way around it how to have a fixed number of near users in an infinite space, except that some of these people need to have, I think, importantly, something. So the beh network has to be here at some point on. Oh, very normally, yeah, it’s just, but you do it from the previousization, not that this is a final number, so you should have, uh, it’s not that I see. But, like, assuming we got this answer, what does that mean in terms of expression? Like what does it mean? What can we not express? Or what will be impossible to express in whatever neural net? Any general outside is confusion. let me tell for moving system distribution and do know anything to have you can’t more anything of you. Yeah, yeah. Just to be a bit more specific, give an example, you cannot, you cannot model periodic functions. Because because you have to be liar from one point over You can’t necessarily come. So if you’re talking about a team that cannot be expressed. Like you an address cannot express their functions. another will never be able to press asideise. Like you can do it for any like interval if it’s in domain, that’s what you’re value for. But you would not be able to do new failures. No matter if my contractions have not been yet. They cannot be fair, right? This is one of those kind of dry results that, you know, people say, like, oh, you know, maybe develop myself. But it points towards, you know, there are some invitations. This is sort of not kind of mathematically move it like you want. And this feels like a very simple limitation, you know, some kind of poor thing that you can’t do any different. stuff. But yeah. Okay, let’s take a break for 10 minutes or something and then we’ll go back and. Probably moved to 5 time probably. Nice, daddy. No, no, don’t. Instagram I mean this is the thing about generally like when it comes from generalisation You’re like a structure. And you hard for you hard for part of that structure, but you don’t know how to distract one piece. You can use learn PDBS and as the reputation of the period you have a little bitment., exactly. Exactly. So there’s a way…. yeah So attention can be taught about the particular kind of that you buy. Whether… Yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, is still the right induct device for anything. It’s like my… And that the price of the election is the correct one for… I’m not sure. But definitely like dimensions Yeah, yeah, yeah. So, so, I can get to this kind of assign me. So, what was these? Yeah. At some point, all these lines. Yeah, you can… Yeah, yeah, yeah. Yeah, yeah I mean, there are But yeah there are So Perance perform not still the high down but eacha they call it different they try to find it they completely different, but. LA work The reason why people are not joking on them is because they don’t necessarily work better. So it feels like there’s no point in changing sort of my flight that being as different. Exactly. So like the hardness for itself so this is like a something that a technology so far that hasn’t happened. And I believe that, actually, there’s a way of showing that there’s a room in the open, there, not a creation. Yeah, the reason why this is man, exactly, exactly. Because one of the reasons that I want to do is called an entire area, and most of our point. And inami, you can’t do that. So like, the technical point of least polards the ex math as well can just coll Yeah, yeah. but like in of so there’s actually a mask where people that doing I might support the people thing if it are getting or not. At this fact, it’s like an area is not just a point, it’s a big normal. So apparentlyly because that half this is not another econ. But as’s the only kind of health that. can also collapse point basically. Yeah when you mentioned he But otherwise is so right now. yeah. I’ll volume my infinite. Into the red quesimi, they more bad, outranges, infinite. So had a yogit, you know. I classify the cool linear locally. Who was that? A lot of… It doesn’t… You have to be sort of a little… So nice. Okay. Okay. right. So um Yeah, I think definitely to kind of go back over all these construction. So the film also, yeah, I know my art of another reason then I mention. I think it’s kind of questions very extremely important around the time it’s separately more. so it was the problem. And, uh, no, no, 13, sorry. OK. So maybe 14, fifteen, I don’t remember exactly. Um, and the reason for it is because, um, We started getting things like this relation, where I was going around, and maybe you tried. The whole idea was you can make a deep model and compress it into a shallow model. So there was a big, sort of, question in the air, whether the, the beat in the deep architectures was useful because it helped learning somehow. But it wasn’t needed because you could always do what you could do in the shower, you can do for it with a demodel. And I think these kind of constructions, I’m trying to argue your position, right? They’re trying to argue that if you fix the model, the size of the model, like in the limit, none of these questions make sense because in the limit that between these universal personal equipment. But anything can appx toimate anything if you the number of whether they shall or deep, whatever. But when you have real network that you actually use in practice, it’s a real question of whether, you know, Am I better off to make the model shuttle work? Or should I make it deeper? Or like, should I do Disney or 10 layer? So I think the, I remember participating a paper. So I’m actually a poker newspaper and I say that. Anyway, so like when we were presenting the papers, it was like a newspaper, I think, because it was funny, there was another tape I in parallel that was around insulation. I was trying to argue that it can take any remodel and converting into a shovel once. And it was sort of like one of those points when you were looking inside a poster, the one in that part of it. They’re like, newspapers are saying your opposite. Like, where is the truth if you’re talking to do this for the same time with this stuff? So it does, um, it was kind of like a, uh, like an interesting theory in time. But I think it is kind of a… Like it might feel technical and obviously the proof itself, the construction is probably not useful for you going forward. Like I, I struggle to find a point in time when someone would ask you to build a construction to show that this model can represent, so like that. But I think the intuitions behind it, or the machinery of how these things work, can be quite useful when it comes to questions of expressivity where it comes to understanding what are the limits of these systems. Like, for example, like this question that I in justice in terms of our friend, it kind of gives you one of doing how this works out. like, you know, this question of periodicity, right? That we are very, like, meaningful question to ask where you can say, you know, I have some, like, biological signal or some, I don’t know, some signal that I’m modelling. The signal itself is prioritic. And, um, you know, you’re trying to continue the transformer. Somehow it doesn’t generalise out of domain the way you want it. And there’s a fundamental reason for it. Like, you don’t even need to run the experiments to see that that happens. Like if you have the right information that you understand is the system, you’ll know beforehand that there’s no point of even trying to do that, right? Because if you tongue with that, like there is a lot of this kind of things that are happening, like there’s going to be a matter of popular example that we’re going to go over at some point. But it’s another soft match, where like anyone who looks at the formula of softmax, carefully enough, should know, maybe I can ask this question, but I think you guys know about softmax, exactly, speak about it sometimes in general, most of them. So, what happens if you apply soft match to an unfinishedly long factor of values? If I have a ve, I keep increasing his life, you know, keep added load this way. And you know, towards the limits then those infinity have add know. And I try to normalise the result max. What what happens? Oh, that’s amazing. Yeah, I’ll be distribution, of course, a normal situation. I will approach sorry, what? Normal distribution.. nice some of components will be used to see. So it’ll actually go to uniform, not normal one. You are right, but you are never behind it. So what happens is you this this, the – the weights are gonna become important. So your distribution is gonna, uh, personally, use uniform. Why is this important? It’s important it came about that people were discussing about LLF and increasing the size of the contract. You train another alarm on some context, we want to have a 100 times bigger complex. So what happens is you increase the contact? 3,000 the attention weights of bec become uniform and then the attention breaks down because with the attention right are uniform, then there’s no point of the attention there anymore. And the whole model breaks down. But it’s like one of those things that like, you know, I mean, maybe I’m making these sounds too, too chill. It’s not trivial. It’s like, there was a paper about it, so it’s, and that’s the new processor that I sent you. So it’s not like, you know, people laughing the terms of the girls around the paper. But it’s one of those things that it’s really down to just sort of looking at how the formula reacts as you change some variables. And in some nights what happens? Okay, so there some details about why this is happening. Sorry, I’m being very I was trying to make a po I being very, very fishy and probably. So the issue that is not just that you have a little number of budgets. That’s not the only issue that has for anything that needs to happen. working important because theologies are fine. They’re not allowed to have viol. And this is true for, you know, on that part, because in your efforts, we have finite weight. The way adverts work all the intermediate values in the your network are not not by necess. Because once we got minus infinity, you Viet non instead of form lot of bread stuff. The whole point is everything is fine, right? So because all the doges are fine, right? none of the probabilities are going to be zero. Because for a soft max to get a 0 probability on one of the entry, you need to have -580, that’s a logic. That’s sort of how it works in a normalisation area. So because of that, as you add more and more elements, you have a realty mask that needs to go to them. So as you distribute property masks, like the peak is going to become short and short and like everything has to go to uniform. There’s no way around it because, you know, you have an infinity number of pieces where you have to give the laptop probability, like you’ll end up everything being sort of unipoint. So that’s sort of the machinery that is going on there. So this is our sort of, I mean, you will revisit this much later when you talk about transformers and stuff like that. But I just sort of wanted to say that like, this kind of mathy things, it might look scary and boring, but I think there is a value to that. So that’s why I wanted to go through the whole yes. you should like before ingredients to college experiments to our inst. Yeah. So, I’m going, what kind is you about? All this on ideal magnet videos? Like, many very similar, but, I mean, the authentics are easy. Yeah, like, where, where, where, where, where, there, there, there, there, there? I Yeah, so it’s a very, like, during a very strange time. So I think I mean, these are changing systems who work surprisingly well for some things, and then they felt catastrophically for others. And I think understanding where the boundary is, is hard, but it probably was the most important aspect right now. So, you know, a lot of these agentic behaviours, the systems are not really operating out of the main. And maybe this is sort of the key component of this. So what happens, uh, with neural efforts at least, is that, um, in terms of expressivity and sort of their their ability to to do things, as long as they stay in the domain, where you have sufficient data, you tends to be fine. Like this whole periodicity thing or even the sort of soft maxim, both are scenarios where you push the system out of domain. So the question is the agent, the agent scenario is like a multi-agent scenario as well if you want. Maybe, at least in terms of formalism, I feel like a multi-agent intitu is a lot better defined than you tend to become type from, like, the term. Um The question is, when you have the sort of interactions between multiple agents, like would these drive any of the component out of distribution? And I think technically it can easily do that. In practice, at least how they’ve been used, it seems it’s not as big of a problem as we think, at least for the scenarios. I haven’t, I mean, I should know more about this stuff, but I mostly use that sort of importance scheme. Well, like generating food and stuff like this. So manage the right both and U I play a little bit with this. As far as I’ve seen it, where it’s okay, it’s like there wasn’t a new kind of game of. But it’s really about sort of the rules of interaction between the agents and whether this can end up having some kind of dynamics that appears out of the domain and pushes the system in a language that they don’t know. So I think that’s where the keys and it’s, yeah, like I don’t know that’s something that’s easy to figure out or not. But that’s how I see that whole life space of what people are doing. um It’s also for LLRs I mean, maybe what, you know, people, it’s hard to understand what’s up to domain and what’s in the domain. Like and then for even for very pragmatic reasons, like the data is always secret, but we don’t know what any of this models has been trained on or what was nothing on. And and all of this kind of element, it makes the whole thing even worse because it’s really hard to even start building sort of testing scenarios unless you know the secret source of how they revealed and no big company would ever tell me what they did in training. that’s sort of hard is. I don’t know what Gemini is trained on because I’m not part of the Gemini team and also within Gemini teams, like teams and teams and like at the end of the day, you know about this little piece you’re working on, but you have no idea of the whole picture and sort of where the meet is coming from and all of that. It’s it’s a bit of a like a big perform many. But anyway, I just wanted to have that kind of trying to be justify this. Um, I mean, I, uh, will probably have one of the homework serve around this kind of math games, probably more than one, so. So hopefully it’s not as scary. I mean, I don’t know. But um But going back to the point that I was telling you, there was Facebook papers to work on predicting each other, right? So there was this guys who they could compress keep models into shuttle on and then it does us are they theles and the 1 thing the shadow models head out. And the question is, how can this both be true? And the answer is, is the facility even a reason why we do this learning? Is it the exc? And we shall a model underperform because of lack of the rust. And obviously, I can’t give you, no, this is not a kind of answer, it can be sort of mathematical. answer to it, but you can give a veryical evidence towards this. And and the particularical evidence I have a paper about your vehicle events, but there’s going be sl later around it was never an issue of capacity. So the emprical evidence is, for example, there’s papers that would try to see if we can memorise image met, right? So you take image that that’s a 1000000 images, considered be it as a person. you know, you say yourself like that. And you replace all labels with random numbers. So now there’s no more any structure that you can use to learn a label. And you try to see whether you can feed that. And you can fit it perfectly with your. So usually people take this as a sign of like the model has enough capacity to memorise the entire set. It does not need to learn some kind of generalising function. But then when you kind that language that you still learn something in generalises, you don’t like to memorise, right? So, so, the, um, and then there’s other items like this where you start trying to see sort of how much people information you can, like, random use of information, control, you know, all of them. You’ll find that the number is actually surprisingly high. Even for show models and you can see this, I mean, it’s hard to tell, but you can kind of see this from the formula as well. Like if you try to plug in those portulas, some realistic numbers for the number of human minutes and and whatnot, because some really astronomical numbers are there. Obviously, it’s hard for you to kind of reason about like how many India regions did I need to represent CPR? I don’t know what they listen. But another marriage is like ridiculously large. Is it like numbers that make no sense, you know, uh, more than the number button, you mean, speak like that, I mean, I don’t know, hear things like that. So it’s not a capacity reason. There is something else, and this is all the speakers are meant to show in a very non, right? But there’s something that comes out of the deep lab, which is because of how the machinery works. So you have this sort of falling of the space and then you draw a line and then your line gets replicated, you know, all the regions that are falling on top of each other, you get a decision bother, you have highly symmetric. So what I was trying to do here is to say, I have this, and I draw some lines in it, and I unfolded, assuming that I folded sort of in a square version I was saying before, I get sort of ashamed and looks like this, which is very highly snatched up. So the intuition and this is even in the original paper that we did. We know this in the discussion at the end. The English is that it’s actually not the number of regions that matter. What matters in a deep architecture is that the slope of these inner regions, they’re like symmetrical. They tied to each other, and it comes from the structure of the layers. Yes. by the scope of the link. So if I’m doing a one B, I mean, you know, if I have, if this line goes like this, this one has to go like this, it’s because they’re like a replication of each other along. It’s so there is there is sort of this very strong symmetries between all of these linear regions. which comes from how you fold in the space. Obviously, it’s not going to be like this, this assumes a very like chess board kind of drawing of the space. It’s a nice square. That’s not what you Netflix like. But you get a very highly symmetrical region and But in addition that I think evolved over time, and this is not not that it’s being solved, because people are not looking at this anymore, as I used to look like 10 years ago in terms of papers. But any reason that we kind of emerged from all of this was that is this symmetry that matter? It’s not a number, but the issue is, we don’t know how to mathematically describe the symmetries. We know how to come. That’s easy. But how to say that like, oh, we had this and this structure, it’s something we don’t know how to we don’t have the mathematical appro story. and that’s why it feel kind of right?. I understand why it’s like like patterns in but I understand why is that I mean, why is this so much? Does it make complication more efficient for this? general. you look at So surprisingly, again, like hopefully I understand why. This is high dimension of data. It feels like the natural data we care about, the decision boundaries in natural data, actually, and whatnot, is highly symmetrical somehow as well. So this access is inactive bias. Basically, it is for you to learn particular style of decision model, which matches the ones that are found in real-day band. There was a point, I don’t know how truth there is anymore, but it’s definitely true if you try to look more at generalisation than anything else, where if you try to learn your networks, on synthetic pay pal that is overly generated, you find that you have a hard time to do that. Like it used to be, well, like people have kind of figured some tricks out and make things easier, but it used to be, I don’t have a feature of it. I- if you try to train image, uh, neural network, some very low dimensional data, two dimensional. Usually have a hard time writing them. And you know, not to extend to work better and better, the higher people ever started. Compared to other mentors, they’ll do better when the input is lower dimension. because, you know, in higher dimension is in fact you have the first of the nationality and all of these states between in and all that, ne letrics are almost the opposite. They prefer high dimensional data. The higher dimensional data the better, right? And that’s because probably, I don’t know why, so probably because we had a national data, there is some property of the decision bomb rate that they get explored. You know, like how, for years, as you increase the number of dimensions, all the volume goes on the surface. And there’s probably something like that happening with decision boundaries as well, right? We go inside a national, something happens in the decision boundaries that somehow the neural network, you know how to exploit. Like, I don’t I don’t have any theory about any papers, but there is a immediately absurd fact, a high dimensional, very, um, and for the wherever it is. And it’s also like, you know, efforts were very well on natural data. So this is why people think that there is, an inactive bias in the architecture, that we don’t know how to explain exactly the words, but that impactive bias is much more natural, natural, data, depending on networks. Um, and um, another, I think people already know about this stuff. We’ll talk about everything, map and filters, particularly to try to make this point, to say how people at the high level describe this in that bias. So, uh Do people know I say this maps? Some some of your, you know. So there is enough look like this. If you put the typers, um, there are time to interfere, while she is being made. So what do the sales maps are about? So it’s really a sensiv analysis. So really what you’re asking is, if I have this image, and I change this piece of here, would my classifier say something? And you just want to look at which are the pixels that are more sensitive to your prediction? And you colour the pixels according to their sensitivity, and you get something like this. And then you use this to say, well, it says it’s a laptop because it looks at an apple side and it says it’s a cat because it’s looking at the face and the pail or whatever. they lose it, right? Um, This is quite popular. I mean, the defined version of this, like, produced in explaining ability, with all kinds of tricks and whatnot. But as a co-hort, it is a sensitive analysis matter, but that’s what it is. And Um looking at the back of information of the sc the way it works. So. Okay. So it’s really like the way I think of it, is literally you take a function and you do a data expansion. I mean, that’s sort of what it is in my car, right? So you say that my function plus the perturbation, particular perturbation is very small, is approximately the same or my function, plus the noise, whatever. But it, uh, the reality for that. So if you want to compute this mask, you just compute a variability, mobile to respect the input and that’s what it sends in. And you re normalalize the lab that it new things than like. So the drink and magnitude leads the manager of importance. Is there, like, any sense in which this is wrong? Like, what are the people so just relying on radius and trying to make a kind of decision? Do you have anyone has an intuition on why this has to have articulary route? Everything comes from, right? It’s scary about you. So, yeah, so let me rephrase that. I think I think this is what you mean. But, um, So 1st of all, here you’re dropping some turns, and then and then usually you grab those turns because your CMRs are in very small. But, you know, you drop in higher order times. But depending on on those I order terms, um, even if App Store is very small, like they might still be 5 meaningful. You have very high order, um, very high norm, higher order, tires, like curvature or, like, or, or, or so forth. There is a way in which this is sort of kind of drastic in our sense. So, um, that thing about a see point, I guess, or you guys know, policy point is, right? So, um, then you have this very, very simple problem function where you have a cook in it that whole signite. And one of the hidden unit is, like, there’s some like teacher engineering happening under the hood, and one of the healing unit is one, it was the word awesome, in whatever types of mule, right? But it is saturated. So like the weights that you’re set by hand or whoever learned the system, learn a weight of a 100 or whatnot. So the word asum is there, that unit goes to a very saturated state of life. Now, your decision might still be based on that unit, and it will depend on the fact that the word is, the word awesome isn’t there. Well, when you computer silency maps, because the, the, um, the CY is saturated, there’s not going to be any greater point through it. So if you write individual to CY, you’ll see that the greater will vanish. re vanish in the saturation. So when you’re gonna do this kind of colouring of the different words in the view, the word awesome is not gonna be read. It’s not going to be lighted out because the model assume that you do not depend on that word. And that’s just because the documentations comes in the site. So, but it differently, some things can act as a bias in your decision. In a way that they don’t honestly have a pregnant. So, like changing them a little bit might not sort of change or tradition because, you know, there is no such idea. But they are important for your decision. Like the fact that they are part of the reason to make decision doesn’t necessarily need this kind of possibility. So yeah, I don’t know if that example is here to anyone. But there is a sense, you know, yeah, it’s like.. Right, so I’m basically trying to argue that a model can rely on a future strongly in terms of making the prediction. So, okay, so just to give you a example. So you have the signal that is saturated. So because the C might is saturated, the needs of rate in close rate. So the gradient, uh, the tendency map would be 0, but they’re a teacher. But the prediction is being used by the model. Um, so so the the output of the model is really, I guess, it’s easier if I give you an example. So, um, Like the model, the model, the model could only depend on that particular feature. If you if you want. uh you can h- how do I this is the for example that I can give you some values for the weights so you can see how it makes sense. Maybe I can make that as a as a point and and send them to tell everyone later today. But you can imagine, yeah, I don’t know, let me try. let’s go a single feedated unit, right? So you have H equals C modes of 10 times future, whatever the teacher is. And then output is sigmoid of it, say, or like, I don’t know, one or one time it, right? So, and this feature V 0 or one, a feature is 0 with if there is no word, no awesome In text, and one, if, right? So now you have a text that says, ask some, some, right? So, because the text says the word, this is a word. This will be 10 same moment of time will be won, but in a very side rate is speed. So now if I’m trying to compute the silency maps, I have my output, I’m trying to compute the great end of my output, which you said, this is positive. They don’t, uh… So the output is part 10.5. It’s going to be point of onely close to us, right? I can make this moment, it can be like, for example, it doesn’t matter. So the output is positive. You try to compute a gradient through this stuff, but later here will flow to age. But then the gradient, so, paid by the radius of the output, with respect to age, then one of these 3 is greater than 0 because I can do my weight such that is – is nice actually. It’s – But the problem is that the million of age may respect to X, so in my case, in respect to that, the magnitude of this guy is zero. Because my heat unit is extremely saturated. And this is just a property of the signal. So, the computer variance through the CIA, the – the – the more confident the CIA is, the lower the grade yard. Because the, the, the gradient of the signal 8 is one minus 3 minus square, I think. So something like that. So, if you have, if you have that your, your, your seat more, your pen is gonna be basically one, then one minus one is gonna be, uh, zero. So this grad and here is gonna be zero. So when I applied the chain rule, the computer grader of the outputs, you describe to F to see if that was a factor in my decision. I will get that the stailency map is zero. But that’s the only reason you’ve made the decision, right? If you can look at your network, it’s a very simple network. Like if you remove that app, if you make up 0, it’s not going to work. But this is basically, I’m basically saying, okay, let me differently. Maybe this will be. Maybe this is the way I should have said it. Because of saturation states and all this other stuff, the um, the function that you’re modelling can be discontinuous at points. Or act almost like discret. And at those kind of points, the gradients don’t matter anymore. They don’t make sense. And this is kind of what’s happening here. I’m forcing this thing in a saturating state to almost be sort of, uh, th- So, so I kind of break the continuous behaviour of it. So, so I think it’s, I think it’s the same thing because drafting is actual opportunity history. It’s the same argument as having very high, known, higher wider charge that would be equivalent to that. So in the extreme, you know, that has very, but it’s just a lot of like a, I wanted to give this as a more like chromatic thing that actually can happen. So you can have these kind of systems that have like similar, they can have other kind of saturating activation function, right? Soft Max is a good example, potential ASL this issue, right? When you’re attending a particular topic very confidently, so your attention when you don’t lost all of that one, then the radiants don’t go with the attention. So then if you’re asking by using a sentency map, while attending this case, you will not be able to tell. You have to look at the attention where you can look at the derivative detention, right? That is the difference. So that like if you look at the attention w, you can see, okay, the weight was one on this particular top. So yeah, that’s the one I was attending. But if you look at the liberty of the output with respect to the top, you’ll get ready zero. So you can be secondly say, I wasn’t attending it. So this is kind of the failure. Is that clear or does it make any sense though? With me but I can understand it in context of more time age or whatever. But for example, I know it’s very question because if it is radio, it could work perfectly even go un sec any? Um For example, I mean, you can have, I’m trying on my mind to see how to build an exam after this. But forever, you can have a similar situation where the decision relies on the unit being zero. And that’s sort of the saturating state and kind of have the same issue. But you have to work with the bias and stuff like that to make that work. But I’m pretty sure you can make size that the model actually looks at a hidden layer and makes a decision based on the fact that nothing is positive, that everything is zero, somehow, by multiplying negative numbers and doing something with a bias and things like that. And when you have a situation like that, again, like if you look at the videos, because there’s zero, there’s nothing flowing, so you say, I do not depend on noise below, but actually that is the only thing you’re depending on. This is a problem with that not only near level. Yeah, it’s because of the noninea because of the infection of non-in and so forth. Obviously, you can’t use it around united to the models on the word. But, uh, it’s, um, I think it’s just sort of, I mean, I’m making a bigger thing than it needs to be out of this. I’m just kind of pointing it out because I know in the interfretability world, people do silency maps left and right. And um, I mean, there’s other places where people would think like this and because you get used to them so much, you start interpreting them with ground truth. So you assume that it can never go wrong. But then sometimes they can go wrong for simple reasons like this. And, you know, it’s almost like one of those instances where something goes wrong and you can’t really figure it out because there is something simple going on behind. So that’s why I’m trying to bring this up because I think in general, like you, you should always have sort of like, You should always be suspicious about anything when it comes to new America. Like any kind of story, any kind of theory, whatever, there’s always an angle where like the assumptions unfold or something breaks down. And this is just for sentency mark, this is sort of one example that I think can be pathological, and I think there are, I don’t have any papers I think here, but I’m pretty sure can find papers that kind of can exploit this and show how you can make sort of the wrong decision, uh, for the picture. There is a less popular nowadays, but a different, I made up this time. I don’t know if they make any sense. So I was calling that the other one backward propagation view of whataintance. This is kind of like a forward propagation view, which is how people start it actually when you are doing this. And the way it’s exactly how we build or who I trying to build that example of the experi just regression. The way people are thinking about things, right? you think of eaching and it that some kind of feature defector. that’s not the usual perspective that you dont have. Additionally, in the IBD, 2009, when I was doing my DB, we weren’t doing a SMS, where we’re having this perspective that every unit has a semantics. It’s a teacher detector. It takes some particular features, and the way she works in the computer, some kind of simularity, and then some pattern, which is probably W, and X. And, you know, the higher the distance, there are more access to much that pattern, the lower, the less similar. And then the activation function is some kind of normalisation, and it’s on a distigulation, or if it’s revel with some kind of thresholding, but anyway, it’s basically, but it preserves the dynamic of the distance. is good, small is bad, you know. And units fire when it’s 90 is maximised. So it’s kind of vacation. And then if you have multiple of these things stacked together, you can kind of use sort of a reconstruction kind of pipeline to try to figure out what it means in the input space. If youre Basically inverting, even though these things are not be veryible, you’re kind of assuming some kind of inverse of the previous things to figure out what is the field fact that the in is going to. And, um, I’m just bringing this up because this is another kind of observation. Maybe you’ve seen this maybe that kind of other the fashion, but this is like visualising the filters in a content right the manual network. So what you do here is you visualise the weights rather than than than the noise. And then when you go to deeper layers, you basically multiply the weights together and make some assumptions about the activation functions, which I didn’t collect. And, uh, these are used, again, what similar user standards and what the model is doing. Maybe this is the most difficult thing, the layer one, this is the most reliable thing as well. And here you can say like, oh, this looks like a board field 1st, right? It’s like a filter for the colour green. so I have this unit responsibility the colour green in the image. I have this, this, uh, goodness that it found, there is a lie like this. I had the swimming that responded, it’s a line like that. And then I keep composing them and get sort of these shape of like this phone too weird to wheels or this kind of weird pattern and then soft white. Um, Okay, so this was a long story, a long way about, but what I really wanted to say was about this impact device. So we both of this kind of isization schemes. People have started trying to look at what the different changes in other deep model. And the usual feature, so this is a really early stuff in 2013. And obviously there is a bias here as well. Like people knew what they wanted to see. So when they were doing this paper, they knew exactly what they needed they wanted to get out of it. But what they see is that the more in the 1st layer, they have like part of objects like eye, the noses, and then part of faces and so forth. Um, and then there’s this kind of, uh, vision of words, so this is word from Yoshua, where what they do is they take 2 images, like the 9 and 3 that are being classified as 9 and 3, and they try to interpolate in input space and in the. And stay away from the screen. Then in the in the area, in the space, or perfume, layer, and in the space of the 2nd human line and so forth. And the argument is the role of that, um, in your networks, even directly biased. And what it does is we have giratical representations. So your article is the standard neuroscience called side kind of training, you know, 1st layer, double clutter, the part about it and more complex object which was direct tired from there. So this – this is the mission that this moulding of the space that we described, is just basically makes the space ammantical. So any direction that you move is connected to some kind of semantical change of your data. And that’s kind of the intrusion. And then there is this own pressure on obstruction and anarchy. So maybe I even have a sl on this. Because then, like, not as much machine learning, but it’s more, like, general philosophy. But there is, if you look at the basis of people learning. So I’m thinking of like review papers from the IV 2010 and so forth from the film just starting. Um, There was this strong assumption that is coming from coxide, that you cannot do reasoning, or you cannot do any kind of, like, interesting processing of the data unless you have attraction. So this is sort of what the core of how people think that the biological system is doing as well, where you play the rose and relate that in beyond more and more abstract representations of it. And that is what allows you to pre the daytime interesting thing. So there is number of abstractions, there’s special obstructions, there’s sort of ironical structure that is being built through this. And if you look at the early theses of what we learn should be, these things were very great, right? They were saying that deep system, you know, for something to be called in lightning, it needs to be deep because it needs to have your actual representations that build on the representation to know it and they need to have this sort of abstraction being built up of more and more interesting things will go up. And this has been used for many years. I was back nowadays, if you ask people, they would argue, even today, that’s why the network works, because there’s sort of abstractions being happening as it goes to the model. To make excited is happening. but these are the news as a big reason for this. I’m going to mention here I’m not going to mention it later as well. There are reasons why I don’t know if this is a true activity. So I’m just going to throw it out there, but I will go back to this later. So there’s a few things that have changed through the years that happens in modern architectures. They kind of break this perspective. I don’t know if anyone has any. The most basic thing, that is crucial for any modern architecture, is key connections. and we’re gonna go as keep connections are about, but keep connection basically allow you to sleep. Instead of going from the answer, you can open the inpput directly to different legets. So because of the sleep connections, the representations are not Earlier, I’m sure what they used to be, because you can, the data flow is differently now. The data doesn’t flow sort of only one way. The other common example for this, but that’s a lot different, and I will get to that, like, it all, is, um, if you think about it in the model. If future model is not clear that they build obtrractions anymore. They basically go through a noisy version of the data and somehow from noise to more noise, more noise, suddenly something comes out of it. But it doesn’t look like there is any like… I mean, this is a bit hand lady and this is on an all right. But I just wanted to put it out there in case anyone is kind of interesting on this perspective. I think, like, this was the basic theme of representation mining, and iClear, which would become friendly. International conference and representation line, and is provided by Yosha, and American and others. And it was really because of this kind of thinking. It’s not clear to me, leave the representations we’re learning nowadays. It’s the same with the representations we are learning 10 years ago, and whether some of the things that they put in these things is about what a good representation is, it’s actually valid. On the box side side, if you look in the literature, it’s not that established either. The majority of people will say that technicians cannot exist without obtrractions and probablyactateations. But you have books and papers that argue that’s not true. And I think the octopus is usually given as an example that might not have sort of the right kind of actual abstraction in the nose for the least in intelligence and whatnot. But what security is stipulated? So anyway, it’s kind of like an interesting quality philosophical thing. and I realising we’re gonna run this amount of time. The other thing, I mean, this was something I mentioned today before, that I think I wanted to say, with the kind of semantical assignments of units. One has to be careful, and the usual rate, you can show that there is a problem with them, is by going to other serial examples. This is sort of one certain examples are. You can take a picture of a panda, mother is somewhat confident it’s a panda. What’s some noise that uses this picture, the look on how to say, and the model classified is something different. And these are someone’s modern examples where you take these scandal and now it says it the bathtub, that throw is a bird and then 4 is a 6, right? And these are, like, the way you fool the marvel is this very more perturbacious that you can’t really see. So you can potentially go back up to other examples. Um, And, and, and, and, and then start them, but, um, I don’t know if you fully understand adversarial examples. mathematically all the world. There’s been a easy where the other example is there because the system is still linear. This is a map, but you can enjoy the map. kind of intition is we need to have a very high dimensional infrosp price. and you can these your noise over all the events dimensions. Because the way the model works, it goes from a very high dimensional thing to low-dimensional, like 10 classes or whatnot, very low dimensional output, then you can potentially distribute that noise size that it makes a big difference. Once it is concentrated into that goes much smaller dimensional case. But you can distribute it over that, I don’t know, 10,000 or 100,000 dimensions that you have in an input, in a way that is not visible to be out. So and and the way this works out is because the system is pre here. So like whatever small noise you distribute over your input, somehow gets carried on through the layers of the model of the otherput. So it will… Might go back to this later on once we establish a lot more of the rest about neural networks, but, um, This is not the only reason you have a personal examples. There are papers out there that plly crack up, which is they come because the system is strongly available than engine here. Um, But, but, but yeah, this is, um, in some sense, this is one of the established way of thinking of that. But like the reason I brought them lot now was But using adversarial examples, um, okay, so there are 2 ways, you have to set examples, are sort of a discussion, it has so far. One is, by using my personal examples, you can show that units are meant to respond to some semantic or pain, like say, uh, this is a unit that takes an eye. You can make that unit respond to something like very different. I mean, that’s just a simple attack, right? You take the output of the unit of the classifiers, they’re classifying the real arrival, not an eye. And now you have an image that’s not of an eye, and now you’re using your unneserial attack by that. We haven’t started this. Well, my find of how you find this, of how you find this noise by – by either doing radio descent or something else. But you can, you can use your attack, you change the input in such a way that it’s still not an eye, but the model detected an eye. And the truth, I mean, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the works is that there’s always… Basically I’m trying to make the activations of the model like I’m assuming the whole thing discrete, I think as long as it’s continuous, slowly, people protect it. Maybe the fact factor norm is to be higher or some changes like this, but it will always be able to attract it. And, and while this is important is because, um, usually I want to use this as a wording of like this kind of people assigning semantics to units and kind of being very drawn to that. And then just like reframing the whole model in terms of, oh, I have this eye detector, this noise detector, and this multitector, and, you know, once I have these little things affected, the next layer will say this is the face and it’s done, and this is how it works, and this is my system. That’s not really how this system works. Those are not, I think there’s another normal detector and it’s not about completing them, right? And the other attack that is here is this interesting paper that they were trying to do, a different kind of attack, where they take around it, like is lava. And they’re trying to leave them out of the classify the images llama, but they want to change as to what the model is looking when it makes the classification. So the normal initially is looking at the face of the llama and classifies the llama because of this. And after the attack, the model is looking at this region of the infrospace, where there is no llama, it’s just background, and it detects the llama because of that. In here, I think this is the butterfly and I started looking at this corner of the image. So again, these are the chats that just say that just because a model makes a new difference based on something that you can interpret in one case, it doesn’t mean that for another similar input, it’s gonna make a similar kind of judgement. So this is basically just a worry about generalising how models work. I think in terms of the ability, for a given input and to give an output, you can do all this kind of stuff, security, and obviously, the curricular things and whatnot, and come up with maybe an explanation why for this particular input, there is difficulties or output. But generalising from that to saying that it will do the same thing for other similar inputs and outputs is wrong. Like they can be like this images. you can’t tell them apart, right? Then look exactly the same. The phys is the same continent, so the reasoning is completely different. uh the same for high Islamas again. But like, you know, so, so, so yeah, so it just meant to kind of, again, this is just meant to kind of give you that sense that, um, there’s a lot of people kind of say about me on that, right? So you shouldn’t trust that. kind of sure. Yeah. All right, look, maybe for the mission, is it possible? is it the you know the original made identified? Yeah, yeah, so this is the original. I’m a good regime, right, from the 45 year. Time did it, maybe, about. you can go backwards then you attack it backwards. Yeah, I think so. So I think, I mean, I don’t know, I don’t know exactly how that works. I think it’s a great and-based kind of thing, but I think you can always attractively change it any way you want. Um, And then usually these attacks work better, the higher dimension of the data is in the in the data. So, what? Well, for the New York argument, is like you have a super dimensionality that is much, much higher. It can distribute your noise over those dimensions in much smaller quantities and still kind of come together to form an attack vector that is large enough automatically. So it’s, uh, so if you’re in a very low dimensional space, it’s very hard, much harder to make attacks whether humans cannot observe by art. ones will start kind attack where like if I’m rout as a person and just so you, I would say they’re the same,. Like I can’t even tell what the noise is. Obviously, I mean, for breaking the semantics of a unit, it doesn’t need to be that you cannot notice the attack. I mean, there are most maybe scary thing about the neural network. I don’t have a flight on it. You can give an input that looks like this and the model can be confident that that’s a fan about or something, right? So the the new network would be very confident on images that are just Gaussian runs. Um, so that – that is – this is, you know, so that – that’s kind of telling you, like, okay, there’s no santis for that. It confused just on, like, random pixels, with the object that he was supposed to semantically detect, that that’s the media work. Uh, we need to think that That’s not what he’s doing. That’s sad. Like, obviously, um, In some sense, we’re sort of in that weird space where you can’t really reason about improve or research or play with architectures unless you make a simplified model of them in your mind that allows you to make hypothesis and allows you to do things. So I’m not saying you shouldn’t assign semantics, it’s a components of a UNI question. Assume that they work. I’m basically saying that as a researcher, you should do that in order to be able to come up with hypothesis, the experiments, and proposals, for me, I take it and so forth. But you should also be aware that all of them are wrong. And you’re inside of pressing the fine line of like knowing that your assumptions will never be true, but at the same time, there’s nothing else you can do, but assume that they’re true and work with them to come up with something new. I think the danger is when people become used enough of the assumption that they’re making, that they don’t realise they can be wrong. And then, you know, they hear sort of the talk to that it’s beyond what it should be. That’s basically what I’m trying to learn here. more or less. Oh, this place is not that easy. I’m gonna put it out there anyway. just not near. which one is this on? Okay. Yeah, it’s a bit big. So the question is, can you have? okay, so let ignore the details this is going to be a very hard problem. Let’s make it the people of more generic. So assuming there is some kind of control over the model side, the question basically is asking whether a shallow model can represent whether a new model can. And about the diverse, and the new model represents anything that each other models has. So are they sort of, this is one of the soccer class or the other. or like how do they compare to each other? In terms of the set of functions they really like kind of goes back to the media regions. There’s no ther. When there is a limitation. It’s that they, they’re controlled for for a number of weights to be sort of, because otherwise it becomes the right question. But otherwise it’ll be simple. But I just thought of high level, like, let’s ignoring how you control for the weight and like, if anyone has an intuition, if not, we can pick it up next time. I can, I can wait, a more deter. I’t have any thought. or like any guesses. What a lady, can A represent whatever we can represent? Can we do we have infinite neurons? No, no, no, there’s shakes. So in this case, the number of neurons sounds are the same name for this game. So like I guess, get A these. I maybe radio there’s no wrong answer. I mean, there’s a wrong answer, but I So, so, so then I, I, I believe I, believe that I cannot realise what we can. So that is the correct answer. Yeah, that. And that comes because we showed that a new model can have a commercial modin area. So because of that ANR represented, right? So, the diplomata have you not placed? But is the rever true is not represent any function that any can represent. Yes. that. It’s natural. And it’s not true because The way it gets more in a region is through those symmetries that are tied. Well, here there is no symmetries between these things. And if you ignore, if you want to check the number, maybe this is an interesting question. I say it hasn’t thought for myself, but it’ll be an interesting decent question. If you want to count the number of unrestricted linear regions that a D model can do, it’s gonna be weight smaller than the Shadow model. because basically you can mostly do that in the 1st layer. and the less layer the space is don’t do much. Or something like that. Sorry? Sometimes you need to leave your mode. Yeah, yeah, yeah, yeah. So you basically, the way you do this, so that you don’t introduce symmetries, you basically make this units to always be positive, so they kind of disappear. They all become linear. So then you end up with a shallow model that has fewer human units than the other one. So therefore you can’t refuse that. So I think this is kind of the reason I brought this out. It’s kind of interesting because people usually have this mental features that you want to sol it better as the shallows. So, um, you know, demod can do whatever shuttle on can do. But it’s not actually true. They’re just, you know, if you control for size, in the limit, they all do the same thing, right? You know, for size, there’s things a shadow model can do that a 2 point cannot. And I think that’s kind of interesting observation But yeah. They also can do more can view that. Exactly. They’re not overlapping. they’re different. I mean, they have some intersections, but they also have things that can that they can they want that. So, so I think that is kind of an interesting observation that, uh, um, I think it’s kind of useful to have the, I don’t know, for me, it feels like the food happens in the, in the model, never mind. Well, you’re making decisions about these things. with that I’m gonna stop today. Exactly on time.
LECTURE 3:
Picture itself, but also the optimizer. Will reshape the the search space in which you're looking for a solution. So, let me try to make this a bit more. Um, expresses slightly differently to make a bit more sense. Um, I mean, I'm gonna say something that's not true, but just to give you like an idea.
So, for example, the models like Spire Solutions, uh, just because of the structure. I mean, it's not true, but just use spicy as an example, right? So, the idea the inductive bias would be more if I train a D model? I usually get a spar solution if I dance if I train a shallow one.
I'll get a dense solution, so this is the inductive bias. This is the distance between them. In, in what is the exactly bias of the model is? You will get solutions. Okay if you believe the whole cognitive science stuff that I was presenting at some point, which is partially true, but not completely true.
It is a deep battle with fine solutions that are more compositional because they have this theoretical structure. So, the way they learn say to classify your cat is a first line Gabor filters. Then they learn sort of composing those four filters into part of objects, and then keep going so far, so it will only converge to Solutions.
That respect this kind of compositional structure. The solution itself decomposes things, while a shallow model doesn't do that. The the shallow model will try to find a template of a cat on the first layer and just match your image to that template because it does not have the depth to compose things.
There's no decomposition going in there. So, so the kind of solution you find are fundamentally different, like they might give you the same numbers, but they're different and ones that the D model finds because they rely on this compositional structure, they generalize better. So when you get a new picture, it works a bit better.
That is kind of the the argument and this sort of compositional structure of the solution. This is, you could argue this is the inductive bias of of a deep architecture compared to a shallow one.
I'll come close that to make sure right here. Oh my gosh. Uh, so usually, right? That's a good question, so I mean, okay. So, this election is a big field, and people have tried all kinds of things. The standard standard things that people do is they put Kyle. Between the output of the teacher and the student, so it looks at the whole like, no, just basically right.
Um. It matches the distributions like the teacher outputs a distribution over the classes, and you match the distribution in the student that is sort of the standard. Um, distillation process, but you can do all kinds of things. I mean, there's all kinds of variants you can play with the temperature of that distribution.
You can do lots of tricks, some help, some don't. Um. But typically, you are not. You know you're not using the teacher to classify an image, and then you use that as a label for the student, because then it's? Well, you're still. It's still not the same as learning on the data, because at least you make the same mistakes that the the teacher does.
Which is actually quite informative? Um, because? Why is it informative? Okay. Even that will help you and I can explain it a little bit in the sense that neural networks they have a tendency to learn smooth functions. So they don't like fast transitions. It's just sort of by the construction of the neural network by the regularization that we typically use.
We converge to, like, low Norm Solutions. It's a low magnitude, and therefore they like functions that change very slowly. Uh, so there, so if you have very nearby images that have completely different labels, this would be high frequency, right? This would be this will look weird. Um, a teacher would not learn that with mislabel.
One of the two images? And that would actually make the life easier for the student. Because because? In some sense. The teacher is even more constrained than the student because of the article structure and and and the student might have the ability to easier learning loud noise, like overfeed to the noise.
But then, I mean, this might not be noise. This might actually be the the actual true function, but in some sense, is like when you use the teacher, you know, the labels coming from the teacher? Those labels make more sense for a smooth function, so it makes it sort of life easier for the students.
So, you still gain something, but that's not why people do. In practice, people really minimize the Kyle between the distribution, the output distribution of the teacher and the student. And then you have the launches that you have all the other information in there. Yeah, does it mean if I have human entities and this human entities can make soft labels for the data set.
Yeah, like, for example, the famous image of cookies and dogs. Yeah, can I make it intentionally more confusing to get this soft flavors? Yeah, so you can definitely. And there are techniques to do this, uh, that help. Not as much as distillation, but, and then, some of them are super silly.
So, one thing you can do is. Instead of having a one pot that says this is the class you put like 0.9 for the class and then just uniform for the rest, and this already helps in learning. Um, you can do more fancier stuff. Like, if you, for example, for image net, you know that there is a hierarchy of the classes of the labels.
There is a structure so you can use that to decide how to put uncertainty, right? You can say everything that's within the same kind of thing. They, they get a lot more, and the rest is a lot zero, so you can do this kind of tricks. They help a little bit.
I don't have them in the slides, but there are tricks that people do Envision, um.
It's not solving all problems, but it does help, you know, people were using them quite a bit. I don't think that you used as much anymore. So right now, people use that data augmentation that helps a lot more. Um, and, and this just this label smoothing feels like extra work that people don't do it.
Well, that's that's pretty good, um? I had some questions as well. I'll try to go a little bit through them because I wasn't sure whether people have questions might not be more technical so, but let's see if we can. So I, I, yeah, I just wanna like, I think I've said this a few times, but like the first question that I had, uh, for people was?
In deep learning. Why is this emphasis over time? Why is the? You know what? Why? Why do people emphasize the sort of deep architectures like the multiple layers versus shallow? I think we just talked about it. Does anyone want to summarize basically? Interactive bit, introduce an inductive bias. Yeah, exactly because that adds this inductive bias, and it's sort of a very empirical things.
Like, if you train deep networks like the well, it's there's a point where it doesn't help anymore, but actually making the initially making the model deeper helps you in terms of performance at some point. If it's too deep, it doesn't. Really help anymore. And um. We're going to talk maybe today about this, but like the the issue is.
As you make the model deeper, you make optimization harder. So, at some point, if you make it too deep, you just can't lie in hand and nothing works anymore. The second one started to see to make sure that everyone understand what the universal function approximator is, so we know that neural networks are deep.
Networks are Universal function approximators, so I wrote here a little piece of code. Can a neural network? Learn this to imitate this code. Exactly, yes. Yes, so, so they're just approximating functions. This is not a function because it has an internal state, so every time you call it like the output is different, even if you call it the same input.
So, so there are architectures that can do this. Networks and so forth. And this kind of moves towards turing machine, and this kind of. Yeah, sorry, yeah, for RNN. And I don't know how specific number of steps right? Uh. So RNs in. In principle, you can run them for an Infinite number of steps, so you can think of them as a dynamical system.
In practice, we usually have a fixed length, which is given by the sequence line. Was that a question or something else? Yes, that's my question. I can't get it like. This function can run theoretically for infinite number of steps. Yes, but RNs can can reach this because of the training limitation, so I can't approximately.
No, the influence the one stops you to to train them. Oh, okay, okay. So, so your question is, um? In order to learn this function, you need to have an infinite sequence Center to see that to make sure. Yes, that is true, like whether, um? I mean, okay. So, for this particular example, I would.
My path would be that if you give sequences of, say, length 10, oh yeah, let's already figure out what the underlying function is, because it's kind of simple and the inductive bias of the RNN, which kind of looks like this. So, you just have some weights to these two things.
When making that, so this is very aligned to an RNN, so it will discover. But in general, if you have any kind of dynamical system. Um, unless You observe the behavior of dynamical system of over an infinite long sequence, there is always some uncertainty like the dynamic system can always do well.
If step is larger than three trillions, then do something completely different and then the other, and then obviously never observes that. So it doesn't know that it has to have this switching Behavior at lecture. Okay. So? This is another question that someone asked me yesterday, so I said yesterday when we were talking about linear regions that this is a bit like origami, but it's not exactly origami.
Do you know, yeah? Why is that the case? Yes, is it nicer to me? Yeah, because origami doesn't collapse it to afflict set point rather than just Paul. Yes, so you know that I said this pathology that they can collapse an entire part of the space, the signal point when they fall, and this is bad.
Uh, the second question is, what can go wrong with the salesimaps? I try to explain this yesterday. I don't know if I've managed, but I want to know if anyone can. See if people understood why, what was like how, how saliency Maps could go wrong? Uh, or is kind of sensitivity analysis gradient based.
Yes.
In front.
Yes, so it is. So, the issue is that, yeah, the sincere analysis is just a first order approximation. So, somehow, the other because of the other higher order terms, this might be misleading, so you could have. Like I can do, I can draw different people than the ones I was drawing yesterday.
Can you imagine? Can imagine a function that looks. A bit like this. Right, then you are here. So you if you compute the sensitivity with respect to this X because here it looks quite flat. You'd say, okay, it doesn't matter. X. It doesn't matter, I can move as much as I want.
The function is not going to change, but then you know as soon as you move a little bit. Things start changing, right? And this is because there's like, third, order fourth order, what? Not so, the curvature changes immediately as you start moving like, you can make it even worse.
You can make it like a b. So, in that scenario, in a scenario like this? You would believe that. Okay, it does not depend on X, but it actually depends quite heavily on X when you start perturbing the model. Um, so another trick that people do sometimes when they do this kind of analysis to to try to understand.
Uh, or to interpret the decision by the model is, you can use perturbations rather than than doing another instead of computing gradients. The gradient is the analytical form. And it breaks down in the scenarios like this. If instead, I look at. F of. And f of x plus Epsilon.
Where I sample a bunch of epsilons? This is a bit more robust. You know, this is like, you know, you will see, for example, here, because what this does is basically approximates this this function with something a bit smoother. And, and therefore, you're gonna capture some of the the change.
So, so there's like, one trick that you can do if you want to be more confident in your predictions, um of whether that variable is important or not? Um. Yeah, I think we talked about yesterday. Can we just delete this? This is my last question is, we talked about visualizing filters.
Um. I didn't talk that much about this. I, I don't know if people actually follow through, but. You know when you visualize filter? Like, let's just look at the first layer, because on on other layers becomes more complicated, but I'm on the first layer. I'm assuming that each unit is?
Feature detector, so what it does? It takes the input X the compos computer cosine distance with, like a column in a weight Matrix. So that is your template. And then, if it's high, then the activation function. Um, fires. If it's low, it means it's not a match. And if you have a value, for example, it goes to zero.
Uh, so therefore looking at these Columns of the weights the templates are comparing to is a way of understanding what the unit does right to visualize this template, and it looks like some object and say, okay, I'm detecting that object, so similar to the sensory analysis. Can you construct, or like, how can this go wrong?
Like, when can you look at this filter and say, oh, this is supposed to detect the picture of a cat, and it actually does not do that. Does anyone have an idea of how they you can break this out? I mean, the whole point here is that you can break any kind of analysis you do.
If this is not going to happen in practice, the CNC Maps is not going to happen in practice that much either, but you know that technically, you can do it mathematically. You can do it, but how can you mathematically break this one? Does anyone have any idea?
What are you not looking at when you're just looking at the weights? Why is it exactly? Sorry, what did he says so, like? One thing you can do, for example, is, can have a bias that's minus a thousand, so that unit is always zero. Like, you can look at it, and you can argue, oh, this is detecting.
I don't know this particular pattern, and therefore, you know, it's going to be used upwards when it does things. But if the bias is? Very negative, and you have a relu, then basically that unit is not used, and you can't tell whether a unit basically it just because it's sort of like a.
Because you start at the bottom going upwards. When you do this kind of visualization, you never check whether you can check what this this unit is Computing, but you don't check whether the rest of the network is actually using that unit. So, therefore, if you start making a hypothesis that is using this unit to do X.
That might be wrong, like the unit might be trying to check whether X is in the image, but doesn't mean that the rest of the network is going to use it to do anything for its prediction. And it happened in practice. If this can happen in practice, yes? Usually, it shouldn't necessarily happen.
In practice, it can happen in practice. If you mess with the optimization process the wrong way, like, for example, you have a strong, uh, like L2 regularization that you apply on the hidden units or something like this, and the model decides that the easiest way to? Uh, minimize that regularizer is to to do the biot, make devices negative, or something like that, like you could.
You could like it could happen, but it's unlikely depending if you have a standard pipeline that is student, and everything is nice here. Um, I don't understand the emphasis on the first layer because? Oh, it's just because it's a bit conceptually a bit easier. So when you go to the next layer with for this kind of filter visualization, usually what you do usually take the weight.
You take the inverse of the activation function and usually activation functions don't have an inverse, so you take something that you approximate as an inverse. So, for example, for relu, the inverse is relative. That's what people typically use, which makes no sense because it's not. But anyway, you take the weight you apply relu, and then you multiply with the with the weights and the layer below to kind of compose.
So easy to collapse. This whole thing to a linear model and then visualize what that looks. And. Yeah, so I put the first layer because I didn't want to talk about this inverse, that I think is wrong. But this is what people used to do. Maps is the most standard thing that people know nowadays in the early 2000s.
This is what almost everyone was doing, looking at these filters and and then trying to. Sort of approximate. The activation function with sort of the expected behavior or some inverse or something like that and then and then get this to be like a linear model and sort of collapse it into.
Into again, sort of like a linear map, and then look at the weight visualize it. Okay. Let me. So today, the idea was, we talked a little bit about representations and and sort of how you parameterize a model. And today, we wanted to talk about learning. And optimization. Um, so, um.
Maybe I, so this is a slide that presented in the past before, so this is not meant to be necessarily correct. Mathematically, it's more meant to be very kind of, like, illustrative of what's going on. I'm just gonna try to explain it in words because maybe words are way better.
Um, but the whole point I'm trying to make here is that there is a distinction between optimization and learning and generalization and and. Um, ml mostly cares about generalization and learning. Uh, but you know? Optimization is more the thing you can formalize and the thing that there's a lot of theory for.
So, what is the difference between these two. So ideally, when you have some loss, some loss, which is just a distance between the output of your model and some label that you have. What you care about is. Minimizing this loss, so minimizing this distance. Over the entire domain, right?
So, like, if you're doing this for images you care about over the entire domain, and this is sort of. What learning and generalization is trying? Generalization is all about, so it's it's not about it's. It's about making sure. That this function H matches the underlying process that generates the label y, right.
So, basically, you want to have good predictions for all possible images. But what, you know, Computing this integral is intractable. We don't have it. We don't have the data for it. You know, you don't have all possible images that ever existed. So, what we can do instead is we can approximate this integral by a sum.
So what we have is you have a bunch of samples? And what we can do is you can minimize the distance. Between the model output and the labels on this data set that we have, right? But this is different from what we care about. Like, we don't care about.
We, we already have the labels for these data points, so the fact that we get a function that gives me the correct label for a data point that I have in a training set. It's not useful, like I already had the light bulb. The only thing I hear is how well is this function responding to data that I do not have, right?
So, I care about this integral. I care about the generation part, but the only thing I can work with is the optimization part. Um and. We will talk a little bit of how? Optimization can help you reach some of your goal in terms of generalizing. But the first part of what I want you to present today was the focus on the optimization.
So for now, like this is what we really care about, but for now, let's ask the question if I have some function. H does parametrize by Theta? And I have a bunch of data points x i y i. How do I basically? Find the Theta size that this distance is minimized on on the training set.
So, if I have some random loss, so this is the last compute on the entire training set. And looks like this. Like, how would you go about finding the Minima of this function? Anyone has any idea? Got 20 cents? Yes. So, how would? How does gradient descent actually work so?
You know, the first thing that you that that you see, is that, like, okay, this looks very ugly, and there is no way of having some kind of close form solution to it. There's there's nothing there to to do that. So, what you have to do is you have to do some kind of iterative search to find your minimum.
So, how does gradient descent works? Um, and I'm again ignore the math, but I think this intuition to me at this is quite useful. So, what you do when you look agent is sent, you do the following. You say, I have no idea how to optimize this function. But I know how to optimize linear functions or even quadratic functions for those.
I have close form Solutions. I know how they work. And, and I know that any function can be approximated by a Taylor expansion right, and the Taylor expansion goes from linear quadratic. And you know, or third time, and so forth. So, what I'm going to do is, I'm going to pick some points.
I'm random point where I am. I'm gonna say, let me approximate the function. I'm trying to minimize by a linear function. Uh, so, so the the red dotted line is linear function. I'm approximating my. And now, the nice thing is. I know where the Minima of this linear function is.
Because the linear function, I know how to minimize it. The issue is that the minimize somewhere at Infinity, either plus or minus infinity, depending on how this looks so it's not? It's not that informative because it's probably it's not the minimum of the of the black of the. Of the initial function I wanted to to to to minimize the the black line here, but the reason it's not matching is because my data expansion is only true locally.
So, then, what I'm gonna say is gonna say, I'm gonna construct the trust region. So, I'm going to say, okay, I can only look within this circle. And then, I'm gonna ask myself, what is the Minima of this red line that I know how to minimize within the region where I think this approximation actually is reliable.
Um, and you can write this down and this is turned out to be. Um. Um. A constrained optimization. That will look something like, minimize. You know, this is the first order tailor. So, this is my linearization of the of the function minimize L Theta plus. Uh, Delta, times the gradient.
Such that the Delta, the thing that I'm minimizing for the step that I'm minimizing for, is small enough that this approximation is actually reliable. And then the way you solve this constraint optimization is you use LaGrange multiplier to take the constraint and push it back into the equation. As I said, the math maybe is not that important, but probably many of you know how the garage multiplier works, but anyway?
So the way you do that you use learners multipliers to move this here, but you don't. You don't care about. Efficient, right? You, you treat that as a as a hyper parameter. So, what you get when you push this up is you get that. The update is actually just gradient descent.
Because you know, this is like when you push this up. This is a this is the second order term, so this becomes a quadratic. Uh, with the identity as the coefficient for the second order term. So when you're trying to solve that you, you basically just gradient. You just get a gradient, so you take this formula.
You take the gradient of it, and you check to see when the gradient of it is zero, and then what you get out of. It is sort of you are your your usual update, where you have the learning rate, which comes from this LaGrange multiplier thing. Uh, does that make any sense in terms of this works?
Or, like, intuitively, what you're doing so when you're doing this iterative map. Like, right here in the center, any kind of iterative optimization. What you're really doing is, you are starting somewhere. You're making a local approximation of your function to something similar that you can solve, and you're restricting yourself to the area where this approximation holds.
And then, you're solving this approximation rather than solving your problem. Within that area, you take that step. You get there. And then, you repeat this process, and you keep going like this. This is kind of the idea of most of these iterative searches.
So this is the same thing. So here's the math a little bit sort of more like spelled out, so you had this term. This terms becomes which you can write it as a Theta squared. I'm kind of ignoring I'm kind of. Being very loose with, like a proper metrics, notation and stuff like that.
And just like, just to make it look a little bit simpler. So, how do you solve? This, uh, so once I put my constraint into the main equation, I get sort of this this equation, that's of a second order equation. How do I solve this? I take the gradient of it and set it to zero.
And I basically sort of get the the update I was looking for, and this gives me the the standard gradient descent update that everyone knows where. Like, at every step, I I move in the direction, the opposite direction of the gradient, right. I take. Uh, and decide just sort of the steps of driving that.
Are there any questions before I jump to the?
Yeah, so you said we took the constraint and put it back in there in this equation, and the Costa Rica is the the normal that Theta is actually less able to. Yeah. So how do we? How did we actually reflect it in the first question? Right. So, like, I multiply.
What it says, is that the solution of this constraint problem is equivalent with the solution of this? Formula where you just add the constraint weighted by coefficient. This is the LaGrange coefficient. This ETA that you end up here. So, this Norm of theta is this term here. And this boosting is sort of the coefficient that comes from.
Um, so this is how you solve sort of constrained optimization, so you have. I think it's for equality, and then if you have inequalities, there's sort of the extended form. Well, anyway. I don't remember how they have a particular name, but in the end, you end up with the same thing.
Is you basically just move your constraints within the equation, but you end up waiting for them and typically what you do if you want to solve this as a. Properly is there is a way for optimizing for this coefficient. Outside. So, you need to. You need to self solve for the coefficient and then plug in the right coefficient here.
And then, when you solve this, you get the exact proper solution, what we do when we're doing. Machine learning is. We're not. We don't care that much. This is, we don't have that much to properly solve this. So, what we're saying is that we're gonna replace this ETA by some constant.
Uh, because? In some sense. Like we. We don't even know what Epsilon should be. Let's put it this way. Like, we, we don't know, like how big this area needs to be, such that my first order still holds. We just know it has to be small step. That's, I mean, you can compute it.
I mean, this is all about sort of adding curvature and all of this stuff into the equation and trying to get a sense of how quickly my loss surface changes to get a sense of what is the the area where I'm gonna trust my approximation? But for now, we're talking about that.
So we, we just randomly pick an Epsilon because we don't necessarily know what you know, because we don't care about what Epsilon is, except that it's very small. We then as well here. Don't don't care about. Solving for the LaGrange multipliers. We just take this form and we just replace this ETA that further on.
It's going to become gamma because. You know, one over two ETA. We're gonna replace this by the lining, right? And then we're. Hypertune for lining, right? All I know about lining rate is the steps should not be too large, because if they're too large, then basically, what happens? You're not respecting this trust region here and?
Moving outside of? Where the approximation hold? Um, but yeah. So, basically, we'll press that this one by the Norm's point. Yeah, yeah. So, this constraint becomes just, uh, yeah, I should have put a square here because you don't want the dealer square roots and stuff like that. Sorry, like, typically, um.
When you say Norms, we always means Norm squared. Okay, so that we don't need to worry about the square root. It actually does not matter what Norm you pick. You can use whatever Norm in this in. In some sense, you know, as long as you're restricting your step size, that's all that matters.
Okay. So, in order to be able to apply gradient descent, so do this all of this. What we need is we need a gradient of the loss with respect to the parameters. Um. And a way to compute. This is the backdrop algorithm, right, that probably everyone knows and the way the background back propagation algorithm works.
You started the loss. Yeah, the top of the loss. And then you apply the chain rule, right? You compute the gradient of the loss with respect to the output and then multiply this with the Jacobian of the output with respect the previous line and so forth, um. And. At some point, I forgot when there used to be this discussion on Twitter, where Jurgen was trying to claim that backdrop is just a chain rule.
And it's from the 1700s. So, I guess one question is, is this just a chain roll, or is there anything special about back problem because this is a proper paper right in the 90s. I think it's a paper from Jeff Hinton and others that introduced a back propagation algorithm.
There's then afterwards, a paper about back propagation through time algorithm, and these are, you know, those kinds of papers like thousands and thousands of citations, and so forth. So, is this just a chain rule that we knew for hundreds of years, or is there anything special about it? Like, what do people think?
I think there's a joke.
Yeah, so the think about back propagation. So, just to defend the paper and Jeff and the others. So, there is a trick here, which is. Um. Backrogation is about engineering, so the idea in backropagation is that it gives you a particular way of computing the gradients. That leads to minimal flops and memory usage.
Like the and, and basically, the trick for a simple MLP the trick. It's kind of easy to understand is the loss is always a scalar. So, when I'm Computing the the gradient of the loss of respect to the output, this would be a vector. And then. From then onward, I have Vector Matrix multiplication, which will result into a vector, so the state that I have to carry on as I go back to the architecture is always a vector.
If I start from the bottom to apply the chain rule, the general does not have an ordering. The general is just well. I can start in the middle and go anywhere, I want. But if I do that, I have to keep around Matrix Jacobians. They're actually quite large. And.
The computer has to spend when you do a matrix Matrix multiplication and sort of the memory that you have to keep is much, much higher than if you do a vector Matrix. So, this is. Basically the. The main thing about back propagation is, it is not? It is the chain rule.
I mean, there's nothing else you can do. You're trying to copy the gradients, but it is a way of arranging your computation to make it quite efficient. And when you have a what's happening? Yeah, about that.
Okay, okay, so when you're? And machine learning. It's all about how do you make these things run efficiently? I mean, that's what made machine learning work. So, so that is the secret. And when you have more complicated, uh, computational graphs, like, for example, you have skip connections. You have recurrencies.
You have all this kind of stuff. Um, it becomes even more apparent that there is a particular strategy of how you're supposed to computer gradient how you're supposed to go through it. That will lead to minimal memory and and and flops. I have a question. Yes, you said a question, but like, this is in 90s.
Yeah, around 50 years, why the one introduced anything that used substantial Improvement that through that propagation? Is it because the bike propagation is already complete and there is no room for improvement? Okay, to just take? For granted, uh, in, yeah, so I mean. Your meaning. Okay, so first off, let me ask the question.
So, you mean improvements in how you compute the grades or improving our Hardware and optimization, like kernel wise, but you are not doing anything related to the algorithm itself. So, in terms of how you compute that gradient. There's no many other options, so this is, in some sense, if your goal is memory and and computer efficiency just to computer grade.
I mean, for optimization, there are other algorithms instead of gradient descent. Like, you can do other things, but just sort of the question of how you compute the gradient. Uh, if the goal is minimal compute and minimal memory, uh, backdrop is the optimal thing you can do. There is no other ordering of your gradients that would give you something better.
There is another thing that people do. For various reasons. I don't think I have the slice for it, but we have what people call forward propagation, which is, you start from the input and go all the way to the top. When you compute a gradient. Um, and this is useful.
In different scenarios. So? I'm trying to think of a okay. I'll give you the standard scenario, but it's a bit sort of out there, but the center scenario is for recurring neural networks and the point for economyo networks is when you do back prop. Uh, you can imagine that you have to go in time all the way to the end.
And then, you start at the end computer gradients, and this is biologically not plausible, and also this makes things quite quite slow because you have to wait. If you have a very long sequence. You have to wait until the end of the sequence. If you start from the input, well, it's nice for recovering your networks is that you can compute the gradient and forward at the same time, and you just go forward in time, and that's it.
So this is the memory. Consumption is high is, um? And Cube instead of n or n Square to something like that. Anyway, the memory consumption is pretty pretty high. And people don't actually use this that much, but it is biologically plausible because you're Computing both signals at the same time, which I mean, it's not fully biological plausible because gradients are not values, but at least it doesn't require any move backwards in time.
It's always forward. And it can be useful in all kinds of weird settings. So, this is called real-time recurrent learning rtrl. It kind of it was big. In the 90s, it went out of fashion, and now it's been revived in some niche papers, but it's not sort of mainstream.
Um, and the reason is that mainstream is because the memory consumption is high. And is typically backdrop. Two times still works the best. If you are just like. With the hydride that we have with everything it just waste the best. But yeah, I mean, I don't think there is much room to improve how your computer gradient.
There is room in. What do you do in the gradient, like, instead of ingredient descent, doing sort of like fancier stuff on top of it? Yes, sorry about you saying that you, you just do forward pass. Um, like you mean that you do forward pass and backward person at the same time.
Yeah, okay, yeah, yeah, because you start at the bottom, right? So you like, even for for a few dollar models like, you don't need to go all the way up, and then you can try to do your backwards if you do. If you start from the beginning as you compute H and at the same time, you can compute the Jacobia and all A2, resp.
And then you keep this going forward together, right? So you keep accumulating. So whenever you do a forward, you compute both the forward and the jacobia. Which is, for most cases, the Jacobian is in close form, right? So, you, you just? It's a function of X the and and W the jackpot here.
Um. That makes sense, right? So, so you can confuse the copies as you go forward, and you can multiply them as well. It's just that. You always get a matrix Matrix multiplication for the forward for between jacobians. And the the memory footprint is much higher. So, there's the issue is like computationally High memory high, but you don't need to go up and down.
You just go up, and that's it. Yes, this is this is the trade-off. Yeah, so there are reasons why you would want to do this. As I said, because you don't want to do backwards. Um, but usually if? Compute and memory. Is your bottleneck then? Backdrop is the optimal stuff you know.
Oh, oh yeah, this is sorry. I did not know I have this light, but this is exactly a combinational graph of the backwards mode, so this is backdrop back propagation, so you start from the you you have the? So, you should start with X, but here's exactly with H.
You have H you compute H2, you compute H3, and you compute loss. And then, from these two, you can compute the derivative of the loss with respect to H3. And from this, you can compute the derivative of the loss 2 with respect to H2 by just adding the Jacobian, so you compute the Jacobian and then you multiply the Jacobian and then you get, and this is sort of the computational graph that you get to the backdrop.
Here is the forward mode that I was discussing where you compute H1 and then from H1. You can, uh, and you compute HR and compute H2, so there should be an arrow here as well, so may challenge to you can compute the Jacobian so down. From then onward, you kind of just keep going down and compute them in parallel.
Um. So? Um, yeah. This is sort of the. What's most popular ways of of computing gradients backwards and forwards? And um? This is actually. Yeah, I mean. You know, we're so used to have these libraries. I'm sorry, I'm just. Uh, okay, so I'll we'll try to get a break in five minutes.
I'm just gonna try to finish my. So, we're, we're so used of using something like python type, bread, or whatnot of the things. So, uh? All of these libraries, the way they work. It's exactly like this, so. I mean, yeah, so then you have a sense of what's going on.
So usually, when you build an architecture. The all of these libraries, where they try to build under the hood, is a computational graph that tells you how the computations go. And then, when you call grad, they just take the computational graph, and then they construct the computational graph of the gradient, and that usually is done by reverting, you know, starting from the end and inverting the errors and you for each the way pytorch Works in specifically for every layer.
Uh, basically, it has defined in close form of the Jacobian of the layer is. So when, when it needs to do the backwards, it just sort of goes through the computational graph, you know. And then, for each layer. It knows what the Jacobian R and computes the Jacobian and construct the graph that way.
So, so usually, um? The main sort of data structure that's behind all of this library is some kind of graph, which is the computational graph and most of these automatic differentiation and some other stuff that you that you do. They're all just sort of manipulations of this graph, based on on specific roles.
So this is how these things work. And then sort of the. The hardest job with this stuff. Uh, which I think nowadays new libraries that don't even attempting to do as much is to try to optimize the graph. Like, I remember in the early days when I was working on this, so I, I spent doing my PhD sometimes on piano, which was one of this first libraries to build things.
One thing that we were worried about is you get a graph like this or more complicated, and then you want to check whether the graph is computationally optimal. So, whether like you can, like, do a different graph that has fewer nodes or fewer edges, because somehow you know you can reuse some of the computer?
And um? Most of the times. The gradually we have. There are not computationally efficient. We're not using compute, so there is a lot of room to optimize these graphs, but like to do that automatically for the user without it's. It's actually really hard, and Tiano tried to do this, and we ended up, so we had, uh, like one of these nodes, was to do recurrences.
It had the op scan, and it ended up that when you had a program that had the scan, you had to wait like 15 to 20 minutes for the code to compile, so to say so that you can actually run it because it was trying to optimize the graph and most of the time.
It wasn't able to do anything useful. Um. But nowadays, I get sort of why people do is when they identify some inefficiency and how computations happen. They build a care now, so they build a specialized layer for that particular thing. And I think this is a much more scalable way of dealing with it, but um.
Yeah, in principle, this is sort of the. The main thing behind this libraries is right. Building the occupational graph, using that to to generate your gradients and other kinds of derivatives or other kinds of things you might care about, like Computing, hessions, and whatnot. It's the same. It's like you can use this graph to to build sort of the graph to computer Hessian, and uh, then manipulation on this computational graph to make it faster.
This is kind of the the bread and butter for for things like pythonic index and so forth. Um, okay, let's take a short. Oh, there's a question, and then we can take a short break. When doing it, I think there is.
We talked about is? Radiant descent, which is iterative process where you are linearizing the system, or you're taking a personal retail expansion of your model, and then you're restricting yourself to some kind of trust region where you assume this. Uh, this linearization holds. And what gives you what this gives you is a constrained optimization where you're trying to minimize the simpler function.
The the approximation that you've made within a constraint on how much you can move? You typically use LaGrange multipliers to deal with the constrained problem to to transform it into an unconstrained optimization Problem by moving the constraint within the equation. This becomes the second order equation that you can solve in close form and what you get out of it.
Is your typical gradient descent step, right, which says move in the opposite direction of the gradient step. But this is the intuition of what's going on is you're. You're solving this problem that you don't know by taking like by making many local approximations and solving the local approximations to move forward in all of this process.
In order to, to do this, you need to have the gradient, the way we compute the gradient because neural networks are just sort of a big composition of functions.
The chain rule in order to minimize memory and compute. And, and that's sort of, what the back propagation algorithm is all about, and in. Auto diff world. Maybe, uh, it's worth mentioning. So, again, there is also like, in some sense, you know, back propagation. There is a paper and everyone celebrated and whatnot, and it is sort of how people in machine learning refer to the algorithm in Autodif world, which is a different Community, a different field.
They would call this backward mode, and the other one would be the forward mode. And I think this probably existed before back propagation. People did not know this. It just that's when people in machine learning realized that they need to do their Synology to save money. Things. If you look across Fields, you know, things get reinvented all the time.
And in different ways. So this is backropagation, and on the next slide. So this is that is an exercise, but it's just something we're gonna work through is notice in an exercise. It just sort of like, generally, you know. You can see what happens if you do backwards and and, and um, when you have a much more complicated computational graph, so you have this tag where you have, uh, from H1?
Computing this to different embeddings H2 and H3, and sometimes you can put the loss and then you know to compute this. What you can do is you can take the Brute Force approach, which is, you enumerate all different Pathways that go from the loss to the input. So, in this particular case, you have two Pathways, one that goes through H2, the other one that goes through H3.
Path independently, and then you add the terms. So this is what in this exercise is called Brute Force. This is not what back propagation algorithm would do. This Brute Force. I mean, it doesn't look like it here, but it is inefficient. Like, if you have a much more complicated dark with a lot more branching, like, just enumerating all possible paths and doing each pass individually leads to a lot of wasted compute.
What backpack would do is basically this algorithm which what it does is basically. Pathways have a common node. You basically collapse them together, so you, you start both Pathways in parallel. You compute the gradient of the loss to H2 the gradient of the loss of H3, and then at this step, because these two Pathways going to the same node.
You sum them, and then from this, so. So, in this particular case, this only saves you one extra Jacobian multiplication. So, if you look at the computations that you have to do is you have to do one last message multiplication, which doesn't seem like much. But when you have a complicated dag, if you do this.
Nation as soon as you can. When you have a node if multiple ages going out of it. You're gonna end up saving a lot of computer memory and then sort of. Fundamentally, the backdrop algorithm, and it's usually framed or some kind of like Dynamic programming algorithm, which gives you sort of the most optimal way of decomposing things.
So? Yeah, I mean, this is the extent that I'm gonna talk about back propagation, but maybe in the homework. Oh, this has reminded me, um, I will. I think today we will try to send the first homework out for people to to start looking at it. It's gonna be due next week on Wednesday, so there should be plenty of time.
But. Some exercises are a bit more open-ended, so I thought the more time you have the better and just to kind of see how, how that goes. Okay, so back from. Okay, so we're still at the back propagation algorithm, so this is a typical neural network. This is the gradient if I take a simple MLP, I also realize that I never mentioned these things are called MLPs.
I think everyone knows that are called them one piece. But if I take a scene, simple MLP MLP stands for multi-layer by subtron, but that's sort of your standard neural networking for Network. I, I write down the chain rule written sort of in the way backdrop would do it from the loss to the beginning.
I get the series of jacopians, so in. When you're doing effect, prep on MLP. You basically have an alternation between two types of jacobians, right? It's the Jacobias that correspond to a linear projection. And usually the the Jacobian of this is just the weight transpose. They should be a transpose there.
But anyway, there's a transpose when you come to the Japan. You have the Jacobian of the activation functions. The activation functions are typically diagonal because the activation function is alignment wise operation. So, the Jacobian itself is diagonal, so you don't actually when you when you multiply with these things, you don't actually need to do a matrix Vector multiplication.
You do an lmy's Vector multiplication, which is kind of faster. But the point that I wanted to make is. You end up with a lot of dacobians that are the Jacobian of the activation function, um? That's sort of. So your choice of activation function matters like we haven't talked that much.
We talked about relu from the point of view of expressivity and what you can do with it, but I haven't talked that much of how do you end up picking an activation function, what you're looking for. So, this is the. Sigma evacuation function, which used to be the default for neural networks and then the level, the one we've discussed and in blue.
You have the um, the forward the the function itself and with the dotted yellow lines. You have the the Jacobian, the derivative of this function, right to see how it looks. And one thing that you notice, for example, for sigmoid. Is that the derivative is actually quite low? So, when you have a very deep model?
And you have to multiply many of these jacobians. You can see that. You know, they can only do one thing which is shrink. The norm of your signal? Um, and then you can bound how fast it shrinks the the norm of your of your signal. You can kind of get a sense of how quickly your signal will vanish, but overall.
That is sort of the reason why sigmoids are not in fashion anymore is because sort of their properties. When you look at the backwards. They're not very nice because what they do is they have this very fast. Um. They, they shrink the, the, the norm of your gradients very fast.
And, um, basically, once your gradients disappear as you go through the chain? You, you stop learning. Because if you don't have a gradient that connects your loss to the input, it means that you cannot find out what is the relationship between your input and the loss, so you cannot really learn anything right?
You cannot learn what is? Y given X? Um, on the other hand. Relo has its own pathology, which is, while detective, the gradient is one, so there's no, no vanishing of the signal. But if the value is not active, the gradient vanishes immediately. Initially. People are worried about travel because it has this region of zeros, so there were.
He was, if I have a model, I initialize it with values. What happens if suddenly in a given layer? All my hidden units are dead are in the zero region. Then there's no signal flowing through. Then there's no learning happening, and my model is. This one is maybe how people were thinking about this or in the early days.
What happens in practice is that that's a very unlikely event. Actually using the standard internalization and so forth. That almost never happens. And in general. Like, if you really want to approximate how relu is what is doing to the backward signal, you might say that relu like multiplies the signal with 0.5 to kind of take an expectation, or you say half of the time is 0 half of the time, and so on.
It's maybe like scaling it down. It's not really 0.5, but anyway, it? It scales down the signal much. Not as aggressive as sigmoid, and that's sort of what. What kind of matters in this thing? Question yes, when does it happen then? When does it become zero? When does it cost the the rather one is a zero when the input is negative?
So, in practice, like, when does that happen because you said there's almost never happens? Oh, um.
It is in both States quite often, like. Actually. So, so you have two states for a unit that are very problematic when you have a value. One is when the unit is always zero, so then the unit is that, and you're basically losing capacity even though you have those weights.
Like, you're not participating in the model at all the other time. When everything is problematic is, the unit is always on. If you have a layer and the unit of the layer, I always on. Basically, you have. You don't have an activation function. You have two linear projections one after the other.
So, again, you're losing capacity. So the optimal use of your activation function is a 50 of the time. Is that and 50 percent of time is active because that's when you actually have access to non-linear behavior from that unit? Uh, so you don't want to be on any of these sites.
Uh, there was another question, yes. And can we just use the leaky relu to be saved from the zero problem? Yeah, definitely. You can use the I mean, and people nowadays don't don't use Railway anymore, right? We have jet glue. Whatever all of this variants, they're all very similar.
But yeah, some of the things that they do, is they soften this this area, and sometimes they have a quick self? But yeah, you can definitely do that. I'm just saying in practice relu is not that big of a problem either, so you can use it safe? Yes, yeah, one of arctified value.
Yeah, and we also use it to actually welcome this issues.
Is the switch that is the gear glue. There is the Leaky relu, and then there's a bunch of at some point. It used to be very popular to come up with activation functions. It kind of died out. There is a an argument that I'm going to talk about a bit later about JetBlue, or or basically, so they call it an activation function.
I wouldn't call it an activation function. It means that's an activation function, but they have this thing where you Branch out, you have, you know, it's like a it's a it's an activation function with parameters you Branch out. You have two different projections, and then you multiply them together, where on one side you have gel low activation function, and the other side is linear.
So, um? In Transformers, though it's being. Now, replaced Again by a different thing, but it used to be the default. And that used to work way better than all the others, but it worked better because you have this sort of gating Behavior. This multiplication behavior, not because of the activation function itself, which in the end was sort of very similar to all of this.
It was jello, which is. Version of Bello. So, so this doesn't matter. So, um. But. In general, most activation functions are a variation on this shape. Um, I think there is still sort of. Open question of what is, like, an optimal activation function? Um, and I think there is still room.
To come up with new activation functions, but I think. The way I would do it if I would have to do research on this topic or if I would try to do research on this topic is, I think there is enough evidence that we didn't domain. Um. All activation functions at some point kind of start behaving the same, particularly if you have sufficient scale, and you optimize your model properly.
So, if you use, like? Not a super fancy Optimizer, but a decent Optimizer that can forg for some issues of the curvature. Like, add them, you know, like, Adam, W, or whatnot? The version of that then, um? And and? You know, you allow yourself to scale as much as you need?
Um, you'll not see any difference between activation functions, but another way to think about activation functions is as part of the architecture. They Define the inductive bias of the architecture. So, the fact that you use value or something else my biases might bias you towards certain kinds of solutions.
We not necessarily know how to describe this mathematically, but there definitely. It's part of your exactly bias, so if I would do, uh, like architecture, search to find new activation functions or anything like that. I would look for activation functions that behave differently out of domain. It's a bit of a lost cause at the moment to come up with the righteovation function that does better than the ones that exist.
I think they all kind of do the same. There is one space where this still matters, which is, uh, from what I understood, for example, a Gemini scale when they do the extra large models. Because everything is so expensive. Like, if you have an activation function that? You know, it's a few less flops than the other, but kind of behaves the same.
It still kind of matters to them because you know, like flops, matter a lot when you have these huge systems or something that helps you on the hardware somehow. Then, sure, I, I would buy that. But if you're like a normal researcher where you're not thinking about this extreme life scale scenarios that I don't think it matters.
Which activation function you use, like, any of them are, is fine? Looking for out of domain? I still think that. The choice of activation function can make a huge difference. Because they they sort of shape. The inductive bias of your model, and I think that's more of a. Reliable space to look for Activation functions.
But yeah, I mean, for, for my lecture, I'll mostly talk about relu because I find it simple, but almost anything that I say. You can replace it with leaky relu or whatnot, and it's kind of the same story.
Uh, what? I was talking about the extension I was talking. Sorry. When I was saying about the extreme scale about flops, isn't that about outliers. It's about, like, making things sort of cheaper, so some of these activation functions might involve say the X function and expect, actually, not that cheap to compute compared to squaring or multiplication.
So, if you can get the same effect or the same kind of behavior with an activation function that uses simpler operations or fewer operations. Atomic operations, then, that can be a big win at when I say extreme, it's all about scale and compute, right? Um, like the?
Uh, so you know anything that you can save there. It's, you can. It can mean a big thing. Um, but even that I don't think is that active, even there. I think people are pretty traditional in their choice of activation function, because it's also like a place where you don't want to take too many chances like you might have this activation function that looks the same as a relu, or like the jaglu or whatever is your activation of choice.
Are not 100 sure. So, like, it's not worth it. Taking that risk, you know, if you have your pipeline, everything is there. You just kind of go with what's there. Um. So? The other thing I wanted to say is that? When you think about optimization and you think about gradients and all of this thing, like, one thing that's important in neural networks is, like, so far, in the talk I've been.
Presenting things from a functional point of view, right? You have a function in computer gradient that function you can go with it, but like, what's really important is to understand how the signal propagates through the network. I mean, I think this previous slide is exactly about that, right? So, like, if you have the wrong activation function as you go backwards through your network from the loss to the input.
The signal might vanish at some point because of the activation function or something else. To a big extent. This is. Maybe one of the main tools that you have to debug architectures, like looking at how the signal propagates with the architecture, is actually quite useful and quite important. And um?
Of. This was one of the big breakthrough for for deep learning. So, and this is comes back to like things like. They were doing sort of initializations for neural networks, and then you have. Coming in, and others who did sort of other variations. And all of these are about.
Uh, signal propagation, and usually the the kind of questions that they ask is. How should I initialize my model, and how should I use my activation function suching the such that, in expectation, the variance of the input does not vanish. As I go through the architecture, so this is usually the way things are being trained and usually look at a single layer and say, like, okay, I have variance one at the input of this layer.
I want lean expectation, sort of to see variance one or the output, and this is sort of the Machinery that they use. I don't know if.
Is too small, then you know you. You get sort of your gradients Vanishing concentrating around zero as they go through the architecture. If your insulation is too big, you can get your model to be too saturated. And again, when the, when the activation functions are saturated. Signal does not propagate well when the activation is just right.
You sort of kind of if you push a gaussian through you, kind of keep having your gaussian as you go through the layers. This was kind of the the philosophy. I mean, you can look at the papers to see sort of the math and how they derive this, but this is sort of the philosophy of intellization and and activation functions is.
I feel like I'm I'm going through too much math, and that's not useful, but maybe the one thing that I should say, which is kind of important is when they do this kind of derivations. If you look at the math, you sometimes make very strong assumptions. Um, like? Yeah.
So, for example, you assume the 10 age is close to linear function, and you get rid of 10h because you don't know how to deal with that age or things like this. So basically, all of this initializations are the output of, you know, applying this formula that you try to maintain variance and then taking kind of some strong assumptions throughout because you know, there's some mathematical element that maybe can be dealt with, but it's messy.
So usually, machine learning and things I messy, you just say, well, I'm just gonna skip it, and I'm gonna assume this is linear. I'm gonna assume this is independent from that, even though they're not and and kind of go forward.
Or not, and then you apply it and it kind of works well in practice and then sort of. You're good to go. Um, so this is the philosophy of ventralization. But decentralizations. And we're going to come back to this. This internalizations because of the assumptions you're making and the sort of these approximations in the math, but also because of what you're trying to do, which is you're looking at a single layer and you're looking at the signal, not.
Preserving variants in expectation. This is not necessarily the correct thing. This is almost, I mean. By definition, you're looking at the average case scenario. It doesn't look at the worst case scenario so. Any of this in Translation. If you take an MLP and you make it deep enough, say, 100 layers or more, you'll see that always the signal Advantage.
Because you're not looking at the worst case scenario. And and and, and then sort of, like when you make the architecture very deep, that that is sort of what happened so? And then, and the truth is. It's really hard to initialize the network size that the signal does not vanish when you have a very deep architecture.
Yes, is it? The question of initialization itself, the vanishing signature visualization it's? So, if you really want to solve it by initialization? And I think James Martin, so he's like a big personal organization so far has a paper on this, but I have to check because he's been working for it for several years.
I don't know if he put it on archive yet or not, but he's been working for it for several years. He's just that kind of person who never puts things on archive. Uh, it's also like a paper that's really hard to read. It's like 100 pages or so, so that's not something you're going to read.
But like, if you really want to solve your financialization, it turns out because of the non-linearity. There is no really close form solution of how to do this. So, what he does is basically he runs. If I remember correctly, like an optimization, like a layer wise optimization process to find out the right instalation of each layer as you go through the network, which doesn't feel a very practical thing, because if you just need to run like an optimization process to each layer just to insulize the model, so then you start training on something else.
Feels quite wasteful, so you could technically do it, but because of the non-linearity, the installation is very data dependent. It's not sort of something that you can bring close for, and it's probably never going to work very well.
Are sort of a much more pragmatical way of dealing with the problem, and it is sort of the reason why we can train these deep models. Like, without normalization layers, you'll never be able to train something that has 100 layers anything like that. Yes, uh, what was the target when they were coming up with these optimizations?
Like, where they're trying to minimize the variance of calculations across layers for the initializations about, so they were just trying to keep the variance in expectation as you go from one layer to another, so you want if you have a input variance. You want the output variance to be the same.
And this ensures that we're not going to have any value. It doesn't work in practice when we start stacking things. A lot of times, this kind of ensures that through that layer, like if you take the layer independently, you're not going to have Vanishing gradients. Doesn't pan out the right way when you start stacking things.
That's kind of the issue, but because you're doing something that's more like an expectation. You want the variance not to disappear. You're not. You're not looking. So, let me like, basically, the issue is, you're not looking at the worst case scenario. So, this slide? This is, um. Work from Andrew Sachs.
Uh, so Andrew sucks. Likes to look at deep linear models, which is basically just an MLP without activation functions. And that's all silly, because that's just a linear model, but the whole point is that you can make sort of theory around it and, and then sometimes the theory makes sense here.
And here's looking at like, what happens to the signal if you just have a deep linear model, so you remove the activation so you can do math. So, what happens if I put 100 layers of those, and how do I make sure that the signal doesn't vanish, and in this particular case, you have a closed form solution, and that is each of your um.
So, then this is kind of taking care of the worst case scenario. So, no matter what signal you're putting in, the matters can only rotate it. It can never like shrink or expand or anything like that. So, in a in, for a deep linear model, if you, if all your weights are like that, then you don't have any Vanishing gradient, and you can have a thousand layers and you still have radiant flowing and everything is fine.
The issue is that because of the non-linearity and in particular all the non-linearities that we have. Or shrinking the signal, or like they're maintaining it in some region like relu and then shrinking in the other, or like shrinking all over the place. So usually what that means is that your linear layers they need to expand the signal.
If you are the signal not to vanish because you know your activation is shrinking it, but the way that happens is sort of data dependent and all of that. So it's it's kind of complicated to make sure. That's why normalization is such a lot more practical because you don't care why the signal is getting smaller, you're just getting it if you.
If you notice the thing that's too small, you make it big again, and you keep going. And I said.
Or which is? Sometimes when you do that? You're messing up with the optimization process. People don't really care about that, but like, for example, I'm going to talk about this one together. But, for example, for bench Norm. Your, your Computing, the gradient of a different function at every mini batch.
Like, you don't have a function that you're minimizing. Like, if you, if you, if you're being very technical from a mathematical point of view, because because of the renormalizations that you do you, basically, you have a different function at every patch, and somehow you're hoping that this process doesn't diverge, and it's actually going to converge to something.
But, you know, it's one of those things where, if you even look at the paper like, no one even asked the question. Like, is this gonna affect convergence of my Optimizer from a theory point of view? Like, just people don't care about kind of details, yes? Oh, there is digital connections.
Yes, there is digital connections help as well for the signal because. When you have residual connections, if you look at the jacobians, the jacobians will have the form identity plus the Jacobian of the residual path. So, in that sense, you're ensured that the signal will not vanish because it always can go through the skip like, you know that the Jacobian can only expand, it cannot shrink.
You have the opposite direction where the gradients can explode because you have identity plus something that's positive. But yeah, residuals definitely help. I mean, the the? The formula that works that everyone is using is residual plus normalization layers. I mean, Transformers resonates. Everyone has that like, actually, nowadays. I will probably struggle to find an architecture that doesn't have residuals.
It does not have their normalizations, like all of them have. It's sort of by default.
Like, what I'm talking here is about the old world that doesn't exist anymore. Back then, we didn't have residuals and organizations.
Oh this grass, so these graphs. Showing sorry, what is this? So, this one is basically when they use this sort of normal initialization, and they're trying to show so. So, I think the the ones where the relative, uh, where the signal doesn't vanish. So, the one that are OT, so these are the ones that use this orthogonalization of the weights that allowed the signal for every, and this is the depth of the model.
So, 50 layers 100 to 100, 400, 800, and so forth. Oh no, this is the depth. I don't know what is 200 is?
Of the layers, and they show that if you have this orthogonal utilization, basically, all models aren't able to minimize the loss, so the loss goes down. Well, if you don't have this orthogonalization, the, the deeper the model at some point, it just kind of breaks, and there's no loss.
This is just like, uh, sorry. So, it means that orthogonal initialization tends to do better than gaussian instrument. Yes. Linear models. If you go to a Relo model, the orthogonal translation is not going to help you. Because of the non-linearities, this is the under sax work, and he is playing with these deep linear models where you can do things in, and he's just trying to make the point that, like, it's, you have to care about the worst case scenario and, and, you know, like, this kind of schemes that are preserving variants?
And so, the gaussian here would also preserve the variance the same sort of severe formula would do for for the relu.
That's sort of the point here. I mean, the paper is more than that. The paper goes over other stuff as well, but this kind of graphs here is. To give you that sense that, like this, the schemes that we do here. They helped a lot, and that's the reason why, you know, deep learning to fall, but they're useful when you have architectures of like two or three hidden layers.
When you have that, this decentralization is enough. When you go from like three layers to 50 or 100, this is not gonna be enough. Your signal will still vanish. The wrong direction. Um, and, and then this is the point that Andrew is making here. Uh, but obviously, the. The final solution that the community kind of ended up with is that the answer is not fixed the initialization, but as these additional things like skip connections and layer Norms.
Utilization on its own potentially can fix it, but it's really hard. And it's not sort of how you want to approach the problem.
There are no other questions. But let's jump into this so. The next bit. I want to talk about is. Um, so this is how gradient descent works. Um, and you can see that. Depending on your loss at any point in time. So, this trust region that you can have where the approximation holds can be wider or narrower, depending on on sort of how things change so.
If you pick the wrong lining, right? So if you hit the wrong step side, you either end up taking a very long time to reach your minimum because you're taking very small steps. And you could have taken big steps, or if you take two big steps, you're gonna jump over it.
And you know your system is not going to convert, you're gonna have this. Behavior. So? Yeah, so it's like, smaller trust regions means a higher LaGrange and penalty. Low learning rate and small steps, slow convergence, and a bunch of other issues. Larger trust region is low language and penalty, highlighting rate, light steps, and instability and lack of convergence.
So, the question is, can we do anything to figure out what is a good step, like, how do we get a good step? Yeah, I have a question. Can we dynamically like compute the size of the trust region, and then make the learning rate Dynamic with it. Yeah, so this is kind of what this section was.
Kind of going to that direction, so? There's maybe different ways of doing this, but one way to do this and this is sort of kind of. Selfie math is to ask the question how much is my gradient changing? And if my gradient is not changing very fast, that means I can take a large step because things are stable.
If my gradient is changing very fast. That means I need to take small steps so that I can recompute the new Direction and I'm moving in. So it's that kind of the high level hypothesis, and then if you. But I mean, ignore the one over anyway. If you just look at how fast the gradient changes, so you take the limit of, you know, the distance between gradients in some direction?
Once you add Epsilon to Theta, what you get back is the second derivative. This is the definition of the second derivative, and if you want to look at the inverse of this, because this is how much you should know, so you should move a lot if the gradient doesn't change.
If this number is small and you should move less if this number is big, so you take one over it then. What you get back is the inverse of the head shell.
Second order method. It's like you compute the curvature, the curvature is a measure of how fast my gradient is changing, and you can use this to rescale your step and say if the the the gradient is changing fast. Then I move slow into the gradient and what it helps, particularly so.
This is sort of if you're in one dimension. If you're in multiple Dimensions, you're basically scaling in each Dimension, independent key. Kind of picture is something like this where you have a valley where you want to actually have high curvature. One direction you have low curvature? Hd will end up jumping from one side to the other of the valley because the step would be light if you'd use the Hessian.
It will scale down on the high curvature direction a lot, so you end up with a direction that just goes down the valley, ignoring, without jumping from one side to another. This is kind of the. Nice type picture of. Of what this should do? Okay. So, um? I'm a bit confused.
Is it that, are we getting many are using LED with the momentum, or like? That's scaling in different dimensions. Yeah, so it will turn out that. Momentum is a way of approximating curvature as well. Without without Computing technology derivatives? Yeah, exactly? And then Adam, uh, or RMS prop is?
Yet another way of computing, second order derivative in some sense, approximating second order derivatives in axis a line. So it basically makes the assumption that the Hessian is diagonal and just looks at the diagonal elements and approximates the second derivative of the square of the gradient, which we will see that under certain assumptions.
It's actually not a bad idea. This. This particular thing is not technically used because this is very expensive. Computing the Hessian is very effective. This is more just a sort of a motivational thing. There are methods that are trying to use things that are not diagonal, so you might have heard about things like kfac or mon or any of this.
Okay, so there's all of this family of new optimizers that are coming out that everyone is excited about them, but They're not used, not at scales, because no matter what you do, Computing this time is quite expensive. And it turns out that you're better off taking a lot more smaller steps that are cheap than taking a big step that's expensive.
There's some Corner cases where taking big steps that are expensive is better. Default, you're better off doing the alternative. That's one thing. The other thing is the. Engineering over the engineering of doing. This is pretty messy as well, so Adam, it's like a one line of code in pytorch.
Their libraries that are trying to make more and kefac to be the same, but it's not really like you end up needing to understand the algorithm a bit more, and you can put a lot more to get there. So, so here, like, the head should not be, uh?
Compute, like, it's like if you rotating to this Valley not to be axis aligned right to be sort of like vertical. The Hessian will still do the right thing, because what it does it basically rotate things in with the eigenvectors, and then for each eigenvector. It knows how fast things are changing in this direction and then correct by that.
So, so the true, like, method that really uses the full hashion doesn't care about the orientations if I would be using something like Adam. Able to deal with this only. This is actually the line by X is the line. I mean, this is one parameter, and this is the other parameter, and you know, one parameter has high curvature.
The other partner has low curvature. Adam connects the learning rate per parameter, so it can give you a small lightning rate for this parameter. But if this is not per parameter, it somehow like, imagine you've rotated your parameter somehow. So, this is leaves living parameters, then Adam will not be able to deal with this.
Yes, what about? Uh, method like LL oh so, so sorry, let's go back. So those are good ways of approaching the the nutrient stock. So, there is the the map Peak, which is, I want to do something like H21 times the radius. There is a little bit it's usually do H, plus Alpha I to the minus one, just to make sure that, like, everything is well conditioned and the computers, there is some regularization to this age Matrix that we present practice, but you know, their formula, which is H, minus L to the gradient, and there is a question of how do I implement this.
How do I compute this? Rpfgs is just one partitionary for that.
Out things. Versus the Haitian is sort of like a Samo Grandpa updates, and you do rank one updated itself. In fact, it's a different approximation of this. There's like a bunch of methods there. Just sort of the what they differ is how they approximate that setting. So, what they're trying to do, technically, is the same thing.
And then they don't require really require much actually using them. Yeah, so generally, like the second order methods, one of the selling point is they don't require hyper parameter tuning, because now you don't have a lining rate in practice. You do end up with a learning rate. It's not really that, but the learning rate is considerably more robust.
Size differences is being taken care by the session, which is trying to estimate the step size that is true. Like all of these methods that your hydro parameters? Faster as well, because taking sort of these optimals at every time step. When it comes to applying Neptune neural networks, they do not work.
Work, but it's not an easy style of your thing. So, if you are in sort of non or convex optimization world or even non-commerce organization world. But, like you're working with, like, uh, nothing neural networks, any other kind of objects you have different kind of functions you want to optimize.
Usually, the second order methods are way better.
Considering more popular, they're not popular outside of the machine learning community, but they're more more popular because they take into account the structure of the neural networks doesn't have an addiction, and it's a great opportunity functionality. One, like, knows that you have this layer, and it does sort of sometimes lots of log structure, approximation of the hashtag, and it does some and help.
It's better because we know that there are some pathologies of the hashia and that counter structure work that you can exploit to make computations faster and so forth. And, um.
For this new are kind of method. You'll see that there are many papers and many groups, and many people are excited about them. They don't work, as well as advertise. So, my experience so far with this method is, if you don't care about the number of steps that you take.
You care about the number of hours on the cluster that you need to run your turn.
Steps instead of a thousand.
Um, the other thing that is not discussed as much. But is that this method? In a stochastic setting. Is in a non-sarcastic setting. But in neural networks, you rarely are in a non-schastic setting. You can never compute a gradient on the entire data set. You have to use mini batches you have to use, like small amounts of data, because.
You have to complete it on small amounts of data, which introduces noise as well. Yeah, noise. And these methods do not deal well with noise.
Are they not supposed to be equal this one and this one? Yes, uh, yes, I, why did I? Why did I say it's not equal? Exactly. Yeah, yeah, they're exactly, but I don't know. I always trying to be careful, and I feel not pretty person, but this is the example of the second order?
I mean, okay, so the way I wrote my way, so this is more like the way I wrote it depending on how you think of it. It's deceptional in the scale, like this is more like a directional derivative. So I'm just Computing the Hessian along Wonder you have to do for all directions I don't know.
Like, maybe in better notation, like, actually write the formula of the entire Hessian the way usually seen. The Hessian is more like formatting, which makes a bit more sense because otherwise, it's a bit unclear how you pick Epsilon and stuff like that. And what does it mean in the limit of Epsilon goes to zero.
We like the limit of the norm of Epsilon goes to zero, maybe, or something like that if some of your Matrix. So it's a bit weird, so that kind of reason I've I've didn't. I didn't spend time to think, what is that I correct annotation, um?
Mattresses and stuff like that?
But one question is okay, we have a formula. This is supposed to be. Ignore all the Practical aspects of it. What happens if I apply this formula to optimize for the function? The people. First of all, what is the minimum of X Cube?
On this, but like, okay, yes, but what is the minimum of x cubed? I'm sorry. So we, those support.
LECTURE 4:
And following the literature, there is. Thing about, like finding preconditional matrices, right? So you have this, you define, you know, you pick a piece instead. It makes your learning faster, and there's some choices of three that don't make things go much faster. In particular. The water. We're gonna talk about it.
Is natural gradient? Which in the end, like Adam, is very related to it. So the way natural. So this natural gradient, I don't even have the citations here. The natural gradient algorithm comes from Amari. This is a paper. It's a very interesting paper if you ever bored and want to read a paper.
So, this is. Uh, Amani started working on natural gradients in the 60s. I think the natural gradient paper for deep networks is like 80s or 90s. Um, it's actually hot paper to write, and Amari sunchi Amari is like a Japanese researcher. And. Until recently, I would say the Japanese Community was kind of isolated from the rest of the machine, like the U.S European community, and you can see that in the paper like it was the most painful paper that I ever eaten.
Because the notations are completely different, like he uses Einstein's summation, which I wasn't used to. He calls by sees a instead of B, and the weights are called. I forgot, like, no, no notation makes sense. Like, if you look there, it's like nothing of the standard notation holds, like the activation functions are weird the way he uses them, but it basically introduces this kind of concept, and all he's saying is, like, uh, I can't explain a high level that the intuition here is, like, we're doing the usual thing, where we're taking a linearization first lawyer, Taylor expansion of our laws, and we want to minimize this, and we want to create some trust regions some constraint so that we don't move too fast, right?
So that the thing costs? But we want that constraint to be in a functional space, so his point is, and this is something that's easy to notice. Is that there are many values of theta that give you the same function, so your your model is not parameterizing a unique way.
So, for example, one thing you can think of is you take two neurons and you swap them around. So technically, you change nothing. Because if you swap here and you swap above, you get exactly the same function. But if you look at the weight matrices. The weight matrices have swapped columns, so it's technically a different Theta, right?
If you're, if you decide two different points in the space, right? So, his point is that, um? The, the, the, the, the. Mapping between parameters and functions is not necessarily one to one, and even if it's one to one, you know, sometimes a change in in Theta creates a big change in your functional Behavior.
Sometimes it doesn't, right? So, it's not sort of like, there is a metric there that says that functions don't change the same way parameters change, so all he wants to do is to say, instead of saying, I want to take a step that is small enough that my Theta doesn't change more than Epsilon.
It says, I want to take a step small enough that my function doesn't change more than Epsilon. So, that's the the big the big idea. So then, the next question is okay if I want the distance between functions. How do I make a distance between functions and that's a very hard question on its own, so the choice that I'm already did was saying, well?
We know in neural networks that the output can always be interpreted as a probability, and it goes back to all of that things about probabilities so we can always think of the output of a neural network as some, some corresponding to some kind of distribution. And we know how to compute distances between distributions.
And this is what we're gonna do so, so P. Of Z, given Theta is basically the the distribution that comes out of your model and we can use Kyle, which is a Divergence. It's not a distance, but it's good enough and everyone uses, so I'm going to use the Cayle to measure how much my output distribution changes if I change Theta, yes, very same.
Here is total distance. Yeah, what does it mean in this context? It means that if I do Cal of P G given Theta. Okay, if I reverse the order of this, terms understand the same. Yeah, it's not semantic. That's the only that is the only one thing that's missing from hard version to be a proper distance.
It's not symmetrical, but otherwise, you know, it's zero. When these things are equal and everything else kind of the the? Triangle inequality, I think, holds. Anyway, there's like a bunch of properties that could fall. It means how different they are. Yeah, yeah, so it measures a way of differencing.
Just, it's not symmetrical so that that's kind of important and. That also means that if I reverse the order here, I'm going to get a different algorithm. I did because it matters when you do your expansion. Okay, so, so this is sort of what he was going for. Let me kind of try to skip out a bit straight.
So now, the question is, okay, so we've done the first step. So, we've we've decided how to make our trust region. But obviously, this time is, is nasty, right? I have the style term. I don't know what to deal with it. So, the next step in this process is okay.
I want to replace this constraint with a constraint that it's a lot more pragmatic that I can actually use to do what I'm doing. Um. So the way he does. This is, uh, he takes a secondary Taylor expansion of the kayam and then plugs in the secondary Taylor expansion of the coyote instead of the Chiablo.
And the reason for that is because once you take the second order expansion of the Kyle, it will turn out that that simplifies a lot. So if you write down. Again, I don't want to go necessarily through the map, but I'm just going to say a few words, and you guys can look over it whenever you want to.
Uh, if you, if if you, if your instinct is, but like, if you write the formula of the kael? Um and. Um, here, I wrote the sum of the values of g, p, z times log in Z minus. So, because you had the ratio of Pz times the log of the other thing, and then you you convert this into expectation because it's easier.
And then you start doing the the Taylor expansion around Theta for this term because you're gonna get the Delta Theta. What happens? And you just have to trust me. Is that the first order in that expansion disappears? Because you get log of pz Theta minus log of P z Theta, so that's zero.
So it cancels out. And then the first order term of the scalar expansion, the one that's just in terms of the gradients. Disappears again. And the reason that the reason celebrating more technical. Let me see if I have it. Sorry, do I have the math? No, I don't have the map right here.
So, the reason there is that you can reverse all of this, the expectation and the derivative are linear operations, so you can reverse their order. So, what you can do is, you can push the expectation inside, and you take the gradient of the expectation of this first return, and then you push the expectation inside, you get a one here, so you get derivative of 1 with respect to something which is zero.
Anyway, you can look this out. But technically, because of the expectation the first time disappears as well, and the only time that is left is the second order card. Which is Delta Theta, the second order derivative of log of P z, given Theta Delta Theta. And this is kind of nice, because now.
The second order derivative of log. Is going to be your metric. Yeah, so if I go back to the constraints, all of these terms disappear. And I really don't have the. No, I don't have it. If I go back here, basically, I end up replacing this tile. By a formula that looks.
Do I have anywhere from Adele? Sorry, I don't, but you replace this guy else by something that looks like Delta Theta second derivative of log P Delta Theta? So, the second order derivative of log P, which is a hashian. This would be my new Matrix. This would be this will.
This will end up being what I'm conditioning with. Um.
This second order derivative. Of of a loss? You can. You can always use. But, am I expecting this here, or am I just kind of? No, I'm not. You can always rewrite this so you can use the layout architecture. And you can see that this this this, because because you know the loss.
Is the loss function applied to the output of your model? You can rewrite it as some of these terms. This is basically just using the chain rule, so you say, like? So I have? Um, the the first derivative of the loss is the derivative, you know, because of the chain rule is the derivative of the loss with respect to the output times the derivative of the output with respect to Theta.
That's just the chain rule, right? That's just writing those two terms. And now, when I'm taking derivative of this again, to get the second derivative, I have the product rule. So, because of the product rule derivative of the product of these two matrices becomes this term class this term.
And this is kind of like a scheme that's being used often and in the second order derivatives and the reason for using this scheme is because the way people usually reason about this is, let's say, I'm gonna drop this Matrix. And the reason I'm going to drop this Matrix is because when I'm close to convergence, this loss is going to be close to zero, so this time is going to be close to zero, so it doesn't matter.
And I'm gonna keep this this Matrix. And this Matrix is usually called the gauss Newton approximation of the Hessian. If everyone heard the term. And the nice thing about this term is that it has this form of Jacobian transpose Jacobian. And you know that when you have this form, this has to be positive definite.
So, this is how people usually get rid of negative eigenvalues. And it turns out that when you do natural gradient. Okay, so I know I put these comments because I was going to derive all of this on the board, but I kind of directed in India, but the whole point here is that.
If I'm looking at this? So I have? This delivered the second derivative. Of. P. Of Z Pizza. So, this is the term that I'm our expectation over. Of this. So I can? Without working too much into it, I can say. That this can be approximated by. Derivative of log.
Times derivative of log B. And there is a transpose somewhere. Uh, I'll put it here. I'm not sure if it's the right place. But basically, you can do the expansion, and you get the distant things. Like, what about the Delta Theta? Sorry, what about the Delta Theta, so the Delta Theta will come in front of this.
So for now, I just took the the this and I wanted to expand. So, the reason I'm trying to do this is because I want to prove well. Uh, I can. I, I'll I'll send to everyone. I'll I'll update the slide with a link. Actually, if you look up this paper that I'm author of the revisiting natural gradient itself.
In the paper, you have this step by step the math in the appendix. If you really want to go through it, so you have the step by step of how you go through there to that, but that for here. Like, I just want to give you the impatient, so the the big change of natural gradient, so you you replace this with this functional distance, and at the end of it, when you take the Taylor expansion?
You end up with a Hessian again. So, one question you could have is like, what have I solved? I had a Hessian before. I did not like it because it had negative curvature. I went through all of this exercise to change it to this distance and whatnot. And in the end, I got another Hessian.
So, is it good that for anything? So the trick, the tricky bit is, you can? Try to expand that that Hessian and you can exploit that. This is an expectation of that Hessian, and it turns out that if you do that, you get that thing is equivalent. To an expectation over the outer products of gradients.
So, the advantage of this form? Is that this is positive definite Pi construction or at least semi-positive definite? It doesn't have to be positive, definitely, but it's at least any positive, definitely by construction, so. Yeah, you can just take my word for it, or you can look at any kind of linear algebra book, but I think the any, any Matrix that has the form a, a transpose, or a transpose.
A I don't know what the right order is. It's positive, definitely. By construction, you can prove that. So, to make your understanding. And then I solved the problem of the neonatos having a very large negative eigenvalues, exactly using this. Yes, because this Matrix does not have negative eigen values.
So, even though in the end, I end up with a hashion. This action does not need recognization, so this plot here is trying to illustrate what's going on. So, the red surface I should have make it bigger. The red surface is the loss function that you are navigating. So, what I'm trying to show here is that once I've plugged in that tile and work it out is, I have a different surface.
And that surface is always quadratic. It doesn't have a saddle and what I'm doing is I'm I'm trying to take a decent step on the red surface using the curvature of the green one. That's basically what we're doing when you're doing natural gradient. We're creating this additional function, which is the Kyle, and we're just taking the curvature of the Kyle and apply it to the function we cared about, and this makes sense because what we're really doing is solving this constraint optimization?
Well, what we're saying is we're solving our original problem, like the first order approximation of the problem, where we constraining how much the distribution is changing in a Kyle sense, and by doing that. We really end up with this algorithm. That it's a precondition HTD, so it has this form where this preconditioner ends up being the outer product of gradients.
So, then it doesn't have any negative curvature. That is the big thing of. I'm natural gradient. Um. Yeah, so I'm not going to work on the board because I feel like we've seen too much math anyway, and I'm worried that it's just gonna look like numbers, but the additional thing about it is, so you have this particular derivation of natural gradient, which says, if I constrain the Kyle and I do a bunch of math, I end up with this preconditioner.
That is just the outer part of Radiance. And then you have this derivation, and this is not natural gradient. This is. Traditional second order methods where people in secondary methods they use this gauss Newton approximation. So, this is a different derivation where it says the Hessian of my loss that I care about is equal with the.
Outer product of the gradients from the output to the input. Times the session of the loss. Plus, the gradient of the loss times the Hessian of the of the outputs. So, this is. This is the sort of a big thing in the optimization world. This is called the Gauss Newton and the reason I wanted to bring.
I mean, there are many reasons I probably had this slide, but maybe one reason I wanted to bring this out is because it's kind of interesting, sort of, uh. Maybe just add a bit of History to make it a bit more fun. It's how these things evolved. So, so am I did natural gradient early on?
But then people weren't actually using it. So for a long time, people were kind of ignoring it. Then there is a paper from Joshua. That is claiming to do natural gradient, but it's actually not doing natural gradient. So, in that paper, they messed up the the preconditioner, and they use a different Matrix.
And, and then sort of. In parallel, you have folks like James Martin's that were using this gaussians and approximation, and they were claiming that Gauss Newton works better than natural gradient because it is connected to the Hessian and whatnot. And as sort of later on. There is a paper that actually shows that gross Newton, a natural gradient, is the same thing because it turns out that this Matrix is exactly the second order derivative of the chiral.
And this is always true if you. Matching activation function analysis. So, if you pick like a reasonable loss and activation function, which you typically do, even if you don't know about it. So, like if you have a negative log likelihood and sort of soft Max or whatnot, or if you have mean Square error and and linear output, then this gaussian approximation is exactly the same as Amari's Fisher.
And all of this. So, the reason why this is important is because it tells you, besides, sort of the motivation that we had here. It tells you why this makes sense. This makes sense because this turns out to be an approximation of the Hessian of your true loss. Because it turns out that this Kyle.
That Thailand, which is exactly. This Matrix is related to the loss of your function. By this term, which typically is seen as being zero.
I think I lost half of you, but the whole story was really about. Uh, natural gradient is usually presented as a preconditioner, because it's not about using the Hessian of your losses, but it's about using a different Matrix, but it turns out that this Matrix has a lot of relationship to the tragesian, so this is sort of the whole point I'm trying to to get to you is that this other products of Radiance?
It's for for very for various reasons. It's a very good proxy of your curvature. Um. And the reason. This is interesting, because if you squinted it, this is very close to the formula that Adam uses. So, Adam takes gradient Square. And we know that radial Square. Now, if you assume here that every Matrix is diagonal, which is not, then this will be just gradient square, and this is sort of a proxy of that.
So, so this is kind of just telling you that you can approximate the curvature of a function by just looking at the square of the gradients. Do you expect this part? Yes, correct words. Adam is a special case of this. So? That's what I'm trying to convince you. They are connected that there is some glitches on the way.
But yes, you can think of Adam and this particular paper if you want sort of a much more formal way about it this. Bayesian learning rule from NTS, Khan, and others are really trying to pin this down and trying to argue that Adam is just a diagonal version of natural gradient.
So, this is sort of. The whole, the whole spill of that paper. It's it's, it's a bit more complicated, um, sorry, I think. Uh, it's a little bit more complicated for a few reasons. Okay, let me let me go why it's a bit more complicated, so? Um, the first reason.
That is a bit more complicated. Is that in the gauss neutral approximation and in the natural gradient? Uh, what you do when you're Computing this Matrix is you have your gradient from the output to your Theta. So, if your function is, so it's from y to Theta, right? This is not from the loss to B guys from y to Theta.
Um. In Adam, you're using the gradients you're using to do your optimization. So, using the gradients that started the loss and go to Theta, you don't use the gradients that started Y and go to Theta. Um, in the literature. This thing is called the empirical fissure, so you have the fissure, which is what Amari introduced and the empirical fissure, which starts from the loss instead of starting from Y.
And reason why the literature around natural gradient. If you ever find this topic interesting and decide, looking at the papers is a bit messy. It's because people confuse these mattresses all the time. Like, you have papers that messed up empirical feature with the proficient and replace one for the other.
They're not the same object. They're different mathematical objects, but people use them interchangeable. So, a first step to go from natural gradient to Adam is to replace the true fissure by the empirical fissure. Um, the other thing that makes Adam a bit different is that you take the square root.
And and and the square root thing is a little bit of. No one really knows, but it works really well kind of thing, so. There is different ways to argue for it. So, one way to argue for it is if you take a matrix. Any approximate that Matrix by the diagonal, so you remove all the of diagonal elements, and you just look at the diagonal then.
Usually, if you look at the eigenvalues of the diagonal versus the eigen values of the full Matrix, you're usually overestimating. The eigen values and taking the square root is you're basically pushing them back down the correct that over estimation. But it's this is a very handy argument, but that's something people use.
Another argument that maybe some people would like is if you look at the formula for Adam. And you pick units for the different quantities in there. It turns out if you don't have the square root, the units don't work. I don't know if you guys know, but in physics, people do this all the time.
If you want to figure out if you have the right formula for something, you just plug in the units and you check that the units come out the right way, right? If the units don't work out, you've messed up the formula. It's not the right one, so it turns out that if you, if you, if you, if you, if you assume like this quantity is a unit like, you need a square root to make the units work.
It's another argument that I've seen for that square root, but really that the answer is that the square root really helps quite a bit and and the reason. And okay, let me give you another better reason of why you have to start it. Yeah, another reason is because you don't use the true feature you use the empirical fissure.
So, what is the difference between these things, so the thing the difference between these things is when you are convergence. When the loss is zero, these times becomes zero, and the true fissure, which is this. This function becomes your hash shape. It's the same as the Hashion, so that's fine.
But if you look at the empirical fissure, the empirical fissure is the outer product of the gradients of the loss with respect to Theta, but if you had convergence the gradient of the loss with respect to Theta is zero. So if you square this thing, you get zero. So, if you look at that formula where you have Grady and over radian squared.
As radians go to zero, that thing goes to zero even faster. Because you're defining high gradient zero. So, what that means is that Adam becomes unstable when you're close to convergence if you don't have the square root. Because if you're looking at that limit, as the gradient goes to zero.
Like that thing, explodes goes to Infinity because you have G over G squared. And the skirtle solves this. Doesn't. Well, it makes it better, which is, which is good enough. It seems for convergence. Because the the rate at which G over G squared goes to. Um, I guess in in this case would be like plus infinity because it'd be like one over GSG goes to zero.
Like, it doesn't go the same at the same speed when you have G over G squared. Yeah, this kind of stays constant, right? Uh, because if you have G over square root of G, squared is equally one more or less up to some variations. So, so this makes things stay a lot more stable.
So, these are being the kind of arguments that I've seen so. Okay, I'm I'm gonna push a little bit through this, so I, I can appreciate this part of the of the live channels. It's a bit kind of heavy. Optimization ends up being that way. I wanted to make another note that is kind of useful.
Which is, we did all of this exercise because we are worried about negative eigenvalues. But one thing that turns out is that you still need to regularize the fissure idea. The Fisher is this Matrix that we've been Computing. This other products are playing with. And the reason you still need to regularize it with the identity.
Is because you, it's semi-positive, definite, so you don't have negative eigenvalues, but you have eigen values that equal zero. And that is still a problem. Because that means you can't compute the inverse. Uh, so in practice, you still regularize this Matrix to get rid of the zero eigenvalues, but at least you know that the only thing you need to do is to correct for zeros.
So you can add like a small Epsilon just to make it invertible without in. Well, I don't really waited by something small. Yeah. Anything, uh, positive? Well, you need to be careful because, like, um. There is this, I mean, you have the condition number, right, which is the smallest I can the, the largest second value divided by the smallest cycle values.
So, if the condition number is very large, so if you look at those, if you have the spectrum of eigenvalues, and you look at the extreme of it, like the largest and the smallest. If the gap between them is huge, then usually numerically. Things are not very stable, like when you try to do the inversion directly.
Things are not very stable, so. You know you, you don't just want to add 10 to the minus 12 and say, okay, I don't have any zeros. My motorcycle value is 10 to the minus 12, because then numerically, your inversion might still be unstable. So, you want to add something that's big enough so that the gap between the largest and the smallest is sort of reasonable, so that the inversion process is.
But this is more in the numerical stability, kind of space, and things like that. It's not anymore about dealing with negative curvature. It's really about trying to make your computation stable when you're trying to the inversion. Um. The other thing that I wanted to mention and sort of. When it comes to practical methods that are trying to do the second order methods, like one thing that they're they're doing, is that they're using a block structure and the reason they're using a block structure is because people have been arguing that the off the blocks that correspond, so block structure.
What does it mean? Is that your computer Hessian per layer? So you don't look at how a layer affects a different layer. You just say, I'm going to compute the hashion for each layer independently. And you know if I'm looking at my whole Hashion? It means I assume all of these blocks are zero.
And I only have non-zero elements in this blocks on the diagonal. Um, and this is just a trick to save memory and compute. And, and the reason is because people have argued that these blocks tend to be very small in norm and. Locks here have, like, so this is just another.
And I'm gonna skip that slide because it's too much math. Um. So, this is sort of a high level, so we, we've done all of this. But then? Um, this is at a high level. Yeah, so this is sort of what natural gradient is about. This is sort of what's economic methods are about, so it's Computing these sessions and it's kind of dealing with negative curvature.
But then, like when it comes to? Training models for? Standards scenarios. Um, it turns out that. Keeping things cheap is crucial to make them work. So, maybe to repeat something I was saying in the break is all of these methods can be very useful when you have. Domains, in which the low surface is well-behaved and the example I gave was, for example, you're trying to simulate the navier stocks formula or something like that you have some neural network that is trying to predict sort of the evolution of this system.
Um, this is sort of mathematically if you look at the loss of trying to predict this. This is going to be something that looks very ugly, so that's a place where you want to use this powerful optimizers. But if you what you want to do is to learn to classify imagenet?
Or you try to do language modeling, which maybe many of you are thinking about, then it turns out that this is not as important, it turns out that actually, the laws behaves quite well and usually. Usually, the outcome is that keeping the updates cheap is the best way to go about it and then sort of.
This is sort of. What Adam and momentum does Adam and momentum you can think of them as some degree? They're sort of a very crude approximation of all of this business about curvature and whatnot. But they're extremely cheap. And because of that, they dominate. They work so much well, in, in almost all standard scenarios, they they work well, there is another sort of hypothesis in the air that people have been talking about it for a while, which is.
Especially for modern architectures. We've been developing them to work for Adam, so there is this formal. I don't know how to call it this fear of. Um, what we have done with Transformer and other architecture like resnets. We basically tuned the architecture to the optimizer, and if we were to be using a much more powerful Optimizer.
We might have found much more interesting architectures, but like, for us, like these things are intertwined, right. Like, whatever, I propose, a new architecture I run into if Adam, and if it doesn't work well for Adam, I assume that new architecture is not a good idea, and I'm not even going to publish the paper so that that itself is like, what?
Maybe drove things to this place where, like, everyone is using Adam? It's it's sort of this bias in the community, where in general researchers prefer to focus on the architecture, not on the optimizer. So, usually, they pick the default Optimizer and run everything with the default Optimizer. So, therefore, when they're developing architectures, they're finding those architectures that have well-behaved curvature so that Adam really, you actually wear to come.
So, this is just sort of all of the things before it's heavy math. And it looks scary, and it's interesting. But in the end. Like, in most cases, you're probably never gonna end up using that. You're gonna just end up using the simple thing, and the simple thing is is something like RMS proper Adam.
And here, I mean, maybe the momentum should have been come first and then talk about that. But here, basically, you have two terms, right? The first one is the momentum term, where instead of um? Instead of using your gradient. You're using a moving average of your gradient? So that does multiple things, uh.
One thing that it does is it reduces noise. Um, and it sort of like shrinking your gradients in Direction. That way, you have high curvature where the gradient changes directions all the time. Um, but also this is not a noise, and it becomes an approximation or a better approximation of the true gradient.
So, instead of like overfitting on the current mini patch. Now you have this moving average of all the previous mini batches, so it's harder for you to kind of be super biased by the current battery idea. And the other thing that you're doing is you're Computing this moving average of the squared gradients.
And one way of thinking about this is the square gradients is a measure of curvature, and this comes from all of these empirical fissure, natural gradient, and so forth, where you can show that. You can always write your Hessian as the sum of other products of radians, plus another term.
And that term usually disappears. So, by the way, I did not say this explicitly. But when you look at this formula, you see that there is a hashank here in the middle. This session, for most losses, is a constant. So, that's why I ignore it. So, this formula, the gauss Newton, is really just the outer product of gradients this session.
It's here mathematically. But like, if you write it down for existing losses, it turns out to be a constant or something that doesn't matter. So, so you know, so that's kind of the connection with, like the gradient square is part of the proxy of your. Um, approxy, for your for your curvature, and then you take a step that is basically.
The momentum divided by the square root of this and, and this is sort of what you're gonna. Is in practice, and you have this Epsilon. Which is this regularization to deal with zeros? We are taking talking talking about how Fisher has zero eigenvalues because it's the same positive definite like this.
SD can also have very small numbers, and you don't want to divide by zero. So, this Epsilon, it's a regular for that. It acts as exactly the same role, so that's the point. Usually, that Epsilon is super small. I know you guys did a course on RL. Uh, so one hint that I have is that in around, it turns out that it's very useful to make that Epsilon very high.
Uh, the reasons is not that well understood, but the difference between supervised learning and RL is that in a row if you do anagram. It's actually quite useful to hypertune Epsilon and to make it larger than normal. Um, but otherwise, this is what it is, and this is, as I said, I kind of replaced UI type.
This is the momentum this is. This is kind of the top part of Adam. And really like the intuition of momentum and how it connects. The curvature is in this picture, right? Like, if you have a direction of high temperature and direction of low curvature. You're gonna get sort of this zigzagging behavior, like.
Now, if you compute the momentum, you'll notice that in the direction where the gradients agree, you're gonna accelerate, and in the direction where the gradients disagree. When you're doing the moving average. You're gonna squish the magnitude in that direction, so you undo the blue curve that it's a lot more aligned to the to the curvature, so this is highly of intuitively how the momentum works, and it's sort of.
There are different ways of deriving it. Another way to deriving it is to take sort of inspiration from physics, and that's why it's also called momentum, and it's really sort of. Like, if you have a ball and you let it go on a Surface, you know, you get sort of a similar kind of effect.
Um there as well. Um. The last bit that I kind of alluded. And after this, when you're gonna take. A short break and then go a little bit into generalization part. So, another thing that I've alluded for, like, so far, we've been talking about gradient descent, which means we compute a gradient on the entire data set for the entire loss.
In practice, that is not possible. So, in practice, what we do is, we estimate the gradient by using a mini batch. And the way this works is we basically need to make an assumption that our data points are ID, so they are distributed according to the same distribution, and they're sampled independently.
And if you make the distribution, we know that the expectation over this is equivalent, or it's approximated by just picking a few samples and and doing the average of samples. And this is sort of the difference. What people say you should expect is if this is greater than design that social classic idea should look right.
It's moving in the right direction, but it has a bit more noise, um. The sun has been mostly introduced as a scheme to make things scalable. So, like in the early days? Everyone, so that radiant decided the correct thing to do, but we can't afford it. So let's do stochastic, and it's a proxy for it in more recent work, and we're going to talk after the break about that.
It turns out that actually stochastic gradient works better than gradient descent. So, it's not about computation efficiency anymore. It's like, it's the better thing to do. You get better results out of it. Um, and, and this is just sort of. In terms of gaming convention, so toxic gradient means your batch size equals one.
Means you have the entire data set mini basebate and the scientist you fix a minute by size. So, you split your data set into groups of whatever Trend. 10, 20 times 56 or whatnot. Um. And then maybe go to this slide and the other very important thing is. Even with all of these measures of curvature and whatnot.
It still turns out that the best thing to do is to adapt learning, right. Learning rate is not a constant somehow because of the approximations we make, and all of that, like these things, will not tell you the right thing. So, by far, you cannot get the right results.
By far, maybe one of the most important thing is to have a learning rate, Decay, or learning grade schedule. This is a very old-school learning race schedule. This is how you used to do it where you started the constant learning rate, and either you have magic numbers when you divide the learning grade by something or you look at the trading loss or validation loss, and if that one doesn't decreases, then you divide the learning grade as a number.
This is how you used to do this, and I used to call this waterfall learning rate schemes. Nowadays, you have this. Like? A linear warm-up, and then sort of exponential decay of your learning rate. And this is kind of the standard, and usually what you hypertune is sort of the highest learning rate that you start from.
And, and then this game's yeah tend to work very well, so this is just sort of like the standard, uh, I don't have a lot of slides about why and how, but if anyone is interested, I'm happy to talk about that. And here, I'm gonna stop here for a break Midway for like 10 minutes, and then we're gonna continue for the last hour to talk a little bit about generalization.
But. And this is. For a long time. This has been the main perspective on learning neural networks. It's just they're non-convex. They have this very complicated loss surfaces. Therefore, it's impossible to train them to do anything useful, because anything you do, you're gonna get into a local in a local space.
And it got to the point, so I was telling you at the beginning of 2000 that they were like, maybe like three or four groups that were still doing here on that tracks, and everyone was doing svms and other things. So, in svms, like, you have convexity. You have, you know, you have guarantees that things are gonna work out.
So, really, what was going on is like, everyone was worried about this kind of issues, and then he had a bunch of people who are, like, well, you know, maybe it works. It's like, you know, it's. It's not great, not terrible. Like, let's go through it. Miracle of what happened is these people were playing around with things.
Obviously, they got some things right as well to make it work. But then they started seeing consistent behavior, so you actually have a paper it's not cited here. But there's a paper from Demetriarchan and Joshua banjo. Important, but I keep anyone wants to see it, where they actually look at this systematically, right?
They do like sample many random seeds, many starting points, and show that there's like the behavior of the neural network is consistent. And all of this kind of stuff, it's just because, back in the day, this was a big theme, right? This was a big reason why people did not like neural networks because they believe this.
Like, it's all about luck, and there's nothing systematic going on there. So this is sort of where the problem of generalization starts right. Like, you have this loss, like, sure, you can optimize on it and get somewhere. That is better than when you started, but is it anywhere meaningful, and how do you consistently go to somewhere?
That is really good, um?
Is, if you have a neural network and you have some data, you throw the data. The neural network of Adam and things are going to work out, and you don't need to worry about anything. And if it doesn't work, just make the neural network bigger, and it's going to work.
And that's it. Another way of framing it is that people now are viewing the optimization problem in neural networks as being almost comebacks. Maybe it's technically not convex by the behaviors if it's convex? And this is on the back of. A bunch of words that these are different plots here from the different works.
I'm gonna try to explain so historically the way things have happened is after some changes, including maybe switching to relu and some changes to initialization and some changes to SGD that kind of happen in the in the background, and then some stuff with the three, um, layer wise, pre-training that the Jeff and Joshua were doing.
Have started to work consistently. And there was more and more results, you know, like the image, net results, and so forth that people could reproduce and and or get these good numbers. And then you started getting a series of papers that were trying to check what's going on. And these are generally empirical papers.
Sample this paper here. This is the paper from Oriole. Uh, you're a good fellow, and Andrew sucks. And what they do is they take the starting point, or you start Theta zero, they take the point where you end up with convergence, and they interpolate on that line. So, this is a line in the parameter space.
They interpolate verifiedly, and they compute the loss at every point on that line. And they show that the loss is monotonically decreasing. Point is if I walk on a straight line from what I started to, I converge. There is no wall. There is no weird shape of my lost surface.
It's just something that's monotonically decreasing, like in a convex case. Done this on a bunch of examples, and they showed this. And you know? From now, this hypothesis that look things looks almost comebacks. If you want to go on, the more methy heavy side. This is the much methy heavy side, and this is actually a pretty cool paper.
I really like this paper by Yando fan. I'm an author so I'm biased. There's another paper by Andy Coon that says exactly the same thing, and people usually cite the animal controls paper. So this is right on my paper here, but this is a different take. So this is.
Statistical physics. They come up with this Theory looking at. Um, gaussian. What what? What are those, uh random, gaussian Fields? Which I do not know what they are, but there is a there's. There's something that's called random gaussian field that's been studied in statistical physics, and they notice that they have this very interesting property that as you expand the dimensionality of the space in which this random gaussian field exists.
Have this, this property that the error that you can get and the index they become strongly correlated. So, what does that mean because there are, this sounds a bit heavy. So, what that means is that the lower the error? The lower the index and the index is the number of negative eigen values, so the lower the error, the less negative eigenvalues you have.
So, in basically, if they, if this correlation is very strong, like, is shown in this picture as the direction that the space grows, then all the. All the? Um, um. Fixed points. That have? Zero negative eigenvalues. They also have very low error. So, let me rephrase this, and this is sort of kind of the theme of where we are now in thinking about the problem.
So, what that means is that as you blow up the size of the model? Your, your all your local Minima will basically have the same error, so they're all going to be very similar to your Global Minima. What you're gonna get is an exponential number of Saddles, and that's why we talked about sellers before.
So, the intuition right now is that you get sort of. The only thing you need to worry about is settle points. You don't need to worry about minimize, because, yeah, I don't think I get the intuition. Why all my local Minima become like there are between them and the development and becomes small?
Why they're all kind of go to the bottom? Yeah. Uh, so.
Yeah, so, so, okay. So, the way this paper works is really taking these results from physics. And here, there's some like proof of this, because the particular kind of thing. And then they, they try to connect this random gaussian Fields, or if both machines and connect both machines and your networks, and then that way, they say, oh, this have to behave the same.
And then they empirically compute this kind of plot that shows the same kind of strong correlation. The intuition of why this should happen. It's a little bit kind of hard to say, like the the usual framing. So this is this is something young, like, who used to like to say is that if you're in, like, if you're in a high dimensional space, the probability that all direction will point up is very low, so as you increase the space, you're always going to find a way of escaping anything that is bad.
That's sort of the way. How, how he is framing it, but really, I think what is happening is, um.
Something weird that's happening with distances where things becomes closer to each other or more equally distanced from each other, and somehow that helps you navigate that space a bit better. It is not clear to me what is going on either, and that's why I call this a myth because we don't necessarily have any kind of.
Grounding of why this has to happen. All I can say is that basically has been observed a lot and and people have realized or or started, kind of connecting the dots and starting saying, like, really, it seems in a standard training regime if I'm trying to. This is another thing is, like.
If you train, or if you stop through the training and you try to validate your Hessian. And you try to see whether it has negative eigenvalues. You'll find that it always has negative eigenvalues. And the thing is, if it has negative eigen values, it means that there is a way to escape to go lower because it means that, at best, you're in a saddle.
So, these are just things that people have observed practically. And, and this is sort of. The the background in which we we are right now. That doesn't mean that Allah surface is not yet behaved, and this is just sort of a few examples of how things go wrong, and I'm going to try to explain it.
So in this first paper, this is sort of more of a joke paper, but it's pretty kind of funny. What is trying to say is that if you give me an image so you have the Christmas tree image, I can find a place in the Lost surface where the loss looks exactly like that image, so there is a Subspace in the loss that has that shape.
So, the point of this paper is trying to say that. If you want in a Subspace of the Lost surface. So the low surface is arbitrarily ugly, but the only trick is that it arbitrarily ugly, far away from zero. So, I mean, there is some like layers here. So, the usual, maybe let me?
This. This deep learning? It's the way it exists and the reason it exists. Is that because everyone who's doing this experiment follows the pipeline there is? There is a protocol of how you train this model, right? There is a scheme to initialize them. There is a way to do the to the garden and so forth.
So the point here is that? If you go far away from the standard utilization, you get into trouble. And one way to get into trouble is, you can have this kind of funky theoretical work that I'm trying to say that. Like, look, you can find any shape you want in the surface, or sort of, maybe even more trivial stuff is.
You can Define your training set in such a way that all your relus. If this is a relevo model, this is what this paper is trying to do. You can, you can buy by, play with your data points of how you pick your data points. You can find the data set such that, for one layer, all your values are zero, and then there's no lining happening.
And there's a bell local Minima because you're random.
Here, which is, there is a way of initializing the model. To make it have zero training error. But be chance on your validation set? So you can completely break sort of disability of the model to generalize by just playing with the initialization. But the core idea, the intuition I'm trying to say here, is that things look almost kind of convex empirically and and things look well defined, and they're sub theoretical reasons that are a bit hand wavy in the sense that there's some very strong mathematical results in statistical physics and understand, like.
On networks that people have been employing. But all of these things look good as long as you follow as long as you do things properly. And by properly, I mean, you have proper initialization and you use sort of a standard Optimizer and you have normalized your data and your data.
You know, there's no battle. Into your data so? As long as you're following the protocol and you're doing sort of normal lining, things look good as soon as you move far away. Things look bad. There is this funky paper. This is just for people who, like, pumpkin. So, this is from.
We have Belkin, who also did double descent, um. Something that, like it's to me, is kind of surprising. In the end, it doesn't mean anything, but it's kind of surprising. So, the three quivarello Network is I can multiply with alpha one layer and one over Alpha. The other layer and I have the same function.
And that's just because if Alpha is positive, it's not going to change the sign. So, the relative doesn't matter. So, really, I have W1 times, Alpha times, 1 over Alpha times, W2. So that becomes W 1 and U2.
This means that the space is always curved because this one of our Alpha is sort of curve. So, if I have a global Minima, your Global Minima is never going to be a point. It's going to be a region, and then region is always curved, because that region has to follow this one over Alpha, because if I have a point on in on, that is a global Minima or a local Minima.
I can construct another local Minima by just multiplying with 1 over Alpha and Alpha the layers below so I can construct this curve. So, what that means is that, and this is sort of the the punch line of the paper you have a look at Minima and Zoom. No matter how much you zoom in, this always going to be curved, and it's always going to have negative curvature.
Attendant attention that you make is that things are locally convex. Therefore, I can use gradient descent and all this stuff. So basically, all of this light. It like details don't matter, but all of these slides are they're trying to say. Is that? Surface or neural networks are relatively complex as a mathematical object.
And and their papers that are trying to highlight this. But at the same time, as long as we use them following the standard recipe, they are extremely well behaved to a point where people don't worry about them anyway. So, um? We'll now try to go into some more reasoning of why things are well behaved and in some sort of the usual reasoning of where this generalization power comes through, not necessarily connected to those early work, but something that's a little bit more like Hands-On and makes a little bit more sense.
And to do that, I'm going to start from the standard kind of point where people start when they start talking about this. And usually, you know the way you would introduce this is you look at a plot like this, and you say, okay, I have some data points, and I'm trying to fit something to them.
I try to fit the line. It looks something like this. I try to fit quadratic. Maybe look something like this, and then I try to to fit a ninth order polynomial, and then it'll look something like this. And if I'm just looking at my training error and my application error?
Whatever. But then, if I'm looking at these plots, like which one seems more reasonable for the data that we have? Yes, probably this one, right? So, there is a sense in which this is our kind of the overfitting underfitting. So, there is a sense where, like, just driving training error down is not good, right?
There is a there is a point where, like, you're not doing the right thing and and and usually. What, um? What we care about in neural network and the way we kind of find this. When is the right time to stop not to get into the sort of nine order fit, it's usually this is done by the usual train test split.
So, so here, like, uh, I'm trying to say is that well what we really care about is to be able to generalize in domain more or less at the moment. Because we're talking about in domain and the way we can ensure that we're not losing that property is, we rely on statistical learning, which basically what that means is that we assume it takes some distribution Pi, and we've sample data from the distribution in ID fashion.
So, every time we want to compute this integral, like the laws that we really care about, the only thing that we have to do is we need to have some unbiased samples so that we can estimate the loss outside, and this is kind of like how things go interactive.
You have a bunch of samples, okay, calling the training set. These are the ones we're allowed to trade on, and then we have someone by example, so called the validation.
Starts increasing. It means things are bad. Um, and, and this underpins. It's basically the entire machine learning. I mean, it's not a big thing, but this is sort of the standard thing, right? Any anything that you do, you always have. You know, when you're dealing with data, you always or training, set the validation set, maybe a test set, you know.
You can do. Other things, but this is sort of the standard recipe, and you always train on training site and use validation set to to to see where you go. This is not the only choice, as I said. And this choice on its own is problematic, particularly in becoming problematic.
For llms, so foreign people now have a really hard time dealing with them, because what is really hard to know there is, whether your test set is included in your training set. A real problem, like with the size of the data that we have and the way the data is being collected and to the ability that we can understand the data.
This sort of? This. This concept of IID some, you know, same thing ID from the from the distribution kind of breaks down and actually. I think it's one of the big problems that that sort of this this the community has in this space and. I'm not sure how much people are actually thinking about it, but it's really harming to the point where.
Is becoming harder and harder to know. Whether a model is actually better or not than the previous one, like, you can get better metrics on different dimensions, but somehow the model still is not better because you're not measuring the right thing. And then this comes from this, uh, this, this, this.
This distributional assumption that it's hard to maintain, so the other choice. And I think I had a slide for it. I'm not gonna go too deep into it, but there's another choice that I think sounds kind of interesting. I think it has problems as well, but it's this choice is given by the minimum description line principle.
Solomon of induction and other funky things like that, but basically all it's saying is that? The model that compresses the data. The best is a model that will generalize. This is kind of the principle, so the idea there is if you look at the? Any beats you need to store the model?
And. How many beats you need to store the data, so you look at these two terms. Basically, this is how many bits the model takes, and this is how many bits the data takes given that model. The thing that minimizes this tool? That is. I was going to generalize the better, and the nice thing about this is if you have this prequential approach.
Is that you don't need a distribution, so maybe let me explain how this works the way, the way it works, particularly just looking at this term. Is that you take a single data point and you fit the best model you can with the data point. And then, you take two data points and you feed the best model you can to data points, and you keep going for every data point and you always fit the best model that you have, and this will generate a curve, which is almost like a training curve, but not really because you for any subset of data you fit, sort of the best you can.
And then you compute the area under this curve.
The thing that has the lowest area under the curve is the thing that's going to generalize better. There is another intuitive way. This has been framed, but it's a lot more vague, but maybe it's useful. And it's not actually in the paper from Yorkshire, which says they basically the thing that learns the faster.
Is the one that's going to generalize the batter? Because it has the right inductive biases when? Like, if, yeah. So, so, basically, the the the concept here is that you don't look just at the loss that you get on the training set at the end, but you look at how quickly you get there.
And if you get faster, that means that somehow you are kind of have the right structure, and you're exploiting the right structure to get there. Therefore, you're gonna generalize as you move, move faster. And yeah, because it may interesting and is better than the idea or spin Transformers, for example, because it has a better.
After some scaling. Yeah, so in some sense. Yes, and no, so three. To be fair, this is a binish. So, in some sense, I can. Yeah, like the CNN do have the right indoccupiers when it comes to images because they assume sort of this positional invariance that the Transformers do not, and therefore you expect that the Transformers.
A lot more. To learn sort of the the true bit of of, so it's almost like if you want, you know, we stopped. We talked something about strong generalization, like in cognitude, basically including something about the images that you know it has to be true, which is like, somehow locality is important when you process images and Transformers have to discover this by themselves.
So, in that sense?
Um, confidence will generalize out of domain instead of in a more meaningful way. When, when, when you have this stress, this concept of locality somehow. And if the Transformers don't get it, exactly right when they they don't learn this exactly right from the data. They will struggle a bit more now.
Empirically, I know everyone is switching to VIT and they tend to work better. I not necessarily a vision person, so I don't know exactly sort of the reasoning behind it, but I assume it's a mix of everyone wants to use Transformers because it's sort of very popular, and you have the right libraries.
And it's a question of also scaling like certain things scale better with, um.
Quite well as well. Which is, if you have two objects that are related to each other, but they're not present close to each other. The only way you can make the connection between them is if you go very high in a convolutional layer. So when I'm going to end up talking the promotionally, I'm gonna show you that.
But basically, the depths now controls sort of how much of the image you're actually see in kamazoo Unity seeing so if you want to connect two two dots that are apart like, here, you need to go pretty high in the area here, but the problem is as you go higher.
The hierarchy you're throwing away high frequency contact, so you're only looking at a sort of a smooth version of the image. So that makes it really hard to reason about, so if you have occlusions, for example, coconuts have a really hard time dealing with occlusions because they have a really hard time connecting objects that are not connected, especially because there's some some something in between them.
Transformers don't have this issue. So, Transformers in a single layer they can jump around because they don't have this.
That is useful. You could also argue that it's not exactly the ground trust indirectly bias, either because you have this kind of occlusion things, and then it works against you so. But, but in, in spirit, that's kind of what this preferential staff is trying to say, I think you had a question as well, but I'm not sure.
Not, not sure. Yeah, I'm just trying to link. The mdl principle with, like, because that's more about, like, Efficiency, right, and, and like, and, and the idea of, like, generalization, then, is it that the cons that the information or the? Is better. Now, if it's more efficient, so it's?
The gambial principle is not. It's something that looks. It's like an information theoretic kind of principle, so it's something that looks at the data. So it's not really about on which thing you're converging faster. It's really like the true definition of it is exactly the one that I said, where you independently for each subset of the data.
You train the best model you can with that data. So, so it's like, like the way. Okay, so the way this has been done in practice. So, like this, experiment at the bottom, so this is from your experience where it is one person that likes this a lot. He is really looking at the area under sort of the training because you can't afford to train sort of a resnet, like a million times.
For every new data point that you have, so you just kind of when you say you retrain for a new data. You just do a single HDD stuff, and that's it, and you start from the previous one. So, so, in that sense, like, I'm gonna show you, if this is even your question, but in that sense, like this principle, like the way it's been used in practice, has been bastardized to be which one trains faster, but really, really.
To test us, like, what is the cutoff of data points that you need to get low training error? And if you can get away with fewer data points. That means you have a better model, so that's kind of the conceptually what's going on, but this has been bastardized into this because this is the Practical bit.
So let me just check what the next slide is. Oh sorry, yeah, I feel it. I'm a bit lost. Yeah, between the two Concepts, like, I understand the second one, which I've been discussing now about the affirmation. How does it take to the faster convergence of the like, so there is no theoretical link?
As such, it is sort of how this original, so this original ndl concept, that is, I'm not mistaking the first paper might be from the 60s, but it's been developed a lot in the 90s and early 2000s. Marcus Hutter is a big name that like to work on this quite a bit.
The way this has been translated to actual day-to-day deep learning is by converting this sort of.
And. The the reason for that is because they were just arguing like a proxy of retraining on this new data set. Is, just take the previous solution that I have and take a single HDD step on a new data set, but there are some information based networks are, for example, if I want to evaluate a language model on some text.
Yeah, I can use the aplexity or text, very character, or whatever. Yeah, and ideally, if that plexity is very low, I can't save. This model can generalize well in this, like purpose or language. I mean, the complexity is the same as the laws. The perplexity is really the likelihood, just like.
In a different range, but it's really the same likelihood object that we have right here. So, what this is telling you is not just, you shouldn't just look at perplexity. That would be sort of the usual way of doing things. If you do training validation set, but it's to look at.
Which model needs the least amount of data to get the perplexity dial? This is kind of the concept that we're that this is going for, and as a matter of generalization, just to kind of connect it today to day. So, the reason I thought it's kind of useful to present this.
This is a bit of a niche thing. So, like, you're not gonna find this in traditional sort of deep learning courses. People are not going to talk about the prequential, but I, you know, just to say that, there is a reason I, I put this in the slides.
It doesn't require a distributional assumption. It, it seems I, I didn't talk about this experiment, but it seems it can figure out spurious correlation. That is to be seen, like, this is just one experiment and one one claim from one paper. But what is interesting, is that, uh, in large?
Language models in empirically at places like Google and Facebook and whatnot. People are actually using this thing without them knowing so. So, what's happening nowadays because of the scale and the computational cost, and I can tell you this from from my own experience is when they do the pre-training stage, they do not have an explicit validation set where they look at the perplexity for that.
Select models based on, I guess, in their mind, on the training perplexity, so you know the training things and they're looking at a collection of the training, but like, if you look at the code and what they're doing in practice again, this is for computational saving engineering kind of tricks, is they usually, you know, because you're doing back back propagation, you first do the forward you get your perplexity, and then you do your backwards and computer gradient and apply the update.
So basically, what they do in practice is they first evaluate on the new data point and then take the step on the data point, which in some sense is very close to what this is saying, and then the other thing that they do, because that point is noisy, is they do a moving average, which is kind of corresponds to looking at the area under the curve.
So, this is usually the metric that, in practice, big groups ends up doing when they have these big models that you know they take most of the computer they have and also the data they have. And, um.
Not exactly done correctly, but close enough in spirit and. This is sort of a claim that. Jorgen and Marcos Hutter by people I know. But there are demon. I work with them. This is a claim they would make. Obviously, they have a bone in the fight. Because preparation evaluation is the thing they came up with, but they're trying to argue that people are actually using it day-to-day.
Think all about the point, suggesting, oh, the the people. So, in large language models, when they have to train these big models, so there's usually in. Lamps. There is the pre-training stage, and then the post training stage. Again, I I don't like these names, but anyway, the pre-training stage is really.
Just have lots of data. Uh, and you train the model from scratch and the only thing you're looking at this perplexity. Post training is usually when you start doing. You know safety training, you're trying to remove hormone language you're trying to do. Rlhf, you do instruction tuning to make the system listen to instruction and all of this.
These are all like fancy names. What it really is post training. It's a bunch of stages of fine tuning on very dedicated data sets, or we very dedicated objectives, and it's in post training that they have like evaluations that are different than perplexity, right? They would they would use different kinds of ways of lighting the system.
Rather, they have qna kind of things, and they look at how well it answers the questions, or they have all kinds of other. They have human evaluations and all of that stuff that all usually happens all in first trade. But the pre-training, which is the bit that takes most of the time, like 99.9 percent of the compute and energy are not lost in the free training.
It's really just perplexity driven, and it's just like predicting the next open. You have lots and lots of data, and the model just predicts the next token.
Computer that you need to use is so big that people usually do not like the way you do. This is you run your training model. You would ask your your training model code to save the weights regularly, and you have another process that you spend on a plaster that would load up that model and try to evaluate it on some set right, and, and so you get some kind of sense of how your validation error goes up.
But like, what people have started doing recently recently, I mean, maybe the last three four years is they don't do this validation job anymore because it's too expensive anyway, and things are expensive. What they do is they actually look at the training error that's sort of their claim. And they say, I can, only I it's sufficient for me to look at the trading error, and that will tell me when to stop.
And if my mother is doing well, and if everything is fine, like I can use the training error instead of validation error.
That you can have. If you just talk some training together, but there are two differences that they do, that makes that not usual training error, so one is the way they compute. It is the exploit the fact that if you want to compute a gradient, you first have to go forward to get a loss so that you can back propagate, because that's how backdrop works.
So, therefore, they they first evaluate on the data point, and then they take the step. Not on data that you've already trained on. It's on the data that you're going to train on, so your computer loss before you train on it. So in that sense, you're not biased, right?
Because you first use the data to evaluate, and then you use the data to trade on. The second thing is they do this moving average, and the reason for that is because otherwise, things are too noisy. But when you do the moving average. It's almost like you're integrating this area under the third string.
Sort of the the it is link, but it's connected to it, you know, yes, why? Another device using this idea of predicting that the data point and doing the gradient step on it? Why would the idea just describe? Yeah, yeah, yeah. And then, I update my weights. Yeah, what?
Everyone is not using it, even if I'm doing a very small CNN about it, and this is the closing like 20 of my assets for validation. Yeah, yeah, I mean, usually, like, okay, so there's no good theoretical grounding for this. If you talk with these kind of people that work in this space, they would say it's working because we never have enough computes to overfit the the llm anyway, so it's fine.
And that's why we can do it. But like on small models like, you know, maybe you sometimes even do it, and you don't realize, but the point is smaller models. People just look at validation error. They don't ever look at the training area like, I mean, you look at the training error for pathological behaviors, but they just you have the validation error.
It's easy to compute most of the scripts you find online would automatically have Computing the validation error, so it's just a practice thing. I mean, I think sort of what, uh, York and Marcus Otter would like is for everyone to do frequential evaluation, and that no one to use the.
But the truth is, it's um.
First of all, really says you need to train the best model you can. For every data like, this is not what we're doing so, and there's no strong Theory to say that this proxy that we're doing is good enough. So, anyway, this is like, this is also new stuff.
Like, it's. It's one of those things like, maybe in some future, we'll end up doing this. I kind of doubt it because I think there is a big gap between the mdl principle and and the sort of trick that we're doing in practice. That might break things quite a bit.
Still kind of interesting to think about this this way. Um, there is also. Plus the time. There's also maybe I should mention it because I find this a bit interesting. Um. This also sounds a little bit like black magic, so if you, if you listen to Marcus Hatter, he really makes it sound like this is the answer to all problems like, you don't need distribution.
You don't need things, but there is a secret in mdl as well, and the secret in mdl is that that Ariana this is, according to me, is that that Ariana, the curb that you are measuring is only meaningful after you've seen sufficient data and what the sufficient data means.
No one knows, so if I have a process that is something like for a million steps via linear function. After a million steps, BF abstractive function that are generating my data like, until you get to the switching point.
To model your your thing, and then when you try to evaluate it as you go further, you'll see that it doesn't work anymore, because at some point it become fantastic. So really, there is this concept that, like, sure, the thing that compresses the most is the closest to the true program, because that's the most compressed form of the thing, but that is, if you've seen sufficient of the data generated by that process, some somehow covers the entire program.
Well defined in the ndl literature, right? So, in the ideal literature, they say, well, you can do this trick. And if you do it. It's fine, but I think there is a Dutcher there where, like, you can always get the wrong decision if you somehow do not go far enough in how much data you see.
Yes, yeah. But at the scale, you said, this is implemented in practice, then that that isn't really an issue. Yeah, for for the nlm stuff, that's not an issue, but I'm just saying this more of. The right way to do model selection, like, say for toy things that's not necessarily clear to people that there is a point where you can't really trust the system unless you've interacted with it enough.
And I think that's never. If you read newspapers, he never talks about that. He only says, I don't need a distribution, and this works, and it's sufficient to look at that end of the curve. And that's the only thing you need. So, how much time I think we have quarter 15 minutes, so I'll try to push for more more slides because I was hoping to be further away in my back?
But yeah, so okay. So this was a this was like a side note this whole frequential learning, but if we go more traditional learning? And this is maybe things you already know. This is kind of the picture you'll find in a book and the kind of thing people will tell you, which is, you usually have these three regimes when you're playing with the model capacity, which is one is you're under fitting.
You have a small model you're trying to train on the data. And then, in this scenario, the the validation loss, the test loss, and the train loss. They track each other, and they start going down. Model is just right. It has the right capacity. And that's when the the test and the train are the lowest.
And then, as you keep training, or as you make the model bigger? The training will keep going down. But the, but the test is going to go up. And this is when you're gonna start fitting the noise. So, this is exactly. This picture, right? So, this is when you're in a good fit, and if you keep increasing your amount of capacity, you're gonna get up this, right?
You're gonna get to to drive the training error even lower, but at the the expense of the of the validation. And, and how does this play for neural networks the way people think of? This is the capacity is the size of the mountain, right? So, of course, you have the linear model versus maybe a quadratic model, but then, in the neural network space, once you move to new natural space.
All neural networks are, you know, Universal approximators, but their expressivity. Their capacity is defined by the size of the waves, like how many ways you have, how many neurons you have. So usually, this is also intuition. You know, you need to pick the right size model for your problem.
Like, that's not too big. Does it relate to the double design Theory? Yes, this is exactly related to double descent, so this is sort of what an old school kind of introduction to to generalization will tell you, but that in practice, we know that this is incorrect. Is, we know that the bigger the model, the better.
And if you keep making it bigger, it's even better. And this is sort of where double descent comes in and maybe, oh, I had a different slide before that. So before I talk about double descent, I just want to say that you can control things by by the model size, but you also can control things by the amount of training steps.
The way to intuitively understand that is, even though you have a neural network that's infinitely wide. It's Universal approximator is the expression that you want. If you limit the number of steps of a GD step that you do, then you limit the number of functions you can reach, right?
Because you can't go through all the parameter space, you can only travel this far, so that's another way of limiting capacity. So, number of steps that you do is the same as limiting number of parameters. You can either make the model very big and limit the number of steps or you, you just have any many number of steps, so just attack.
Yeah, because I'm kind of like swapping between this. Uh, and, and maybe that's a bit confusing. And and, and, and another way of controlling capacity is through regularization. Sorry, I'm gonna get to double descent after this. Uh, and and regularization is really like if you have a loss that has multiple local minimas or multiple minimas.
You have a regularizer that has only one. Minima say, this is an L2, so it has a minimize zero when you sum these two surfaces together. You basically make this Minima have different values now. Because you know the regularizer will prioritize the Minima that's closer to zero. Away from zero to the other higher value to it.
So that's a way of thinking of, like how the regularizer solves the? Um, the problem is by prioritizing things that are closer to you. I'm gonna skip over that because it looks like a lot of mans. So this is just really showing, and this we talked about it before.
This is showing that, like if you're being proper Beijing. But if you're being proper patient from being proper, Beijing, you get basically a regularization term, which is your prior. So, I think we talked about this, so usually if you're being properly patient when you try to optimize things, you get that your objective is the negative log likelihood, so it's how well you fit the data.
Plus, how well you respect your prior, and if you pick your prior to be sorry. A gaussian Center zero that will end up being an L2, so that basically says that I prefer solutions that have small arms. So, so this is just sort of a probabilistic way of getting to the regularization turn, and it's basically sort of just a natural outcome of having a prior, um.
And um, this is kind of. Yeah, maybe there's not that much to say around here. I'll come back to this when I actually talk about contents, but just to tell you, there's we have many techniques to regularize all of this are constraining the problem. So here, now you have to learn that both these sleep versions of the image are a catch, so you have to learn more.
So, you use more of the capacity, so you know? Slide. Okay, there's quite a few slides before the double descent. Do I have time to control that much light? Jump to the double descent. Me, just try to go through all of them. I, they should be after this. You should be the double descent.
So, we're we're on the regularization point. There is another observation, so we talk, so these are more traditional ways, or at least sorry. This one is a super traditional way of regulating the models. This is sort of like textbook, I don't know 2000 way of regularizing the model. The there is other funkier ways of regularizing a model.
This idea was thrown out was from from your girl in 1997, but is this idea of flat, sharp minimum? So there is this effect that people have found it was very surprising for a long time. We were doing mini batches GD because we couldn't afford to do anything else and at some point gpus caught up.
There were extremely powerful, so people were like, let's just do full GDs because you know, now we can. And that's the right thing to do. And then, when you do full GD, it turns out that. Is my choice, and then sort of. There was this big debate of what's going on, like, why is HDD helpful?
And it turns out the the one of the hypotheses that is kind of sticking around. This turns out that the noise is helpful because it allows you to escape sharp minimum, so the intuition here is you have a loss like this where you have something that's very narrow, and then a flat Minima because of the noise in in.
In SGD, you cannot converge here. The noise will push you out, so the only only minimal you can converge to is the Minima that's wide enough that it's much wider than the variance of your noise.
More like mdl kind of principle for why this is useful. So the argument is the flat minimized a much more compressed. Like, you need less bits to describe it, but another way to think about it is if this is my training loss. I don't have a slide on this, but if this is my training loss, you can imagine that the test loss is just basically a noisy version of this is this, but like shifted a little bit and whatnot and the the thing here is for the Teslas.
If you shift a little bit here because this is narrow, you're already. Your loss is going to be very high, but because this is flat if you add any noise, the loss will kind of stay the same. So that's another intuitive way of understanding Wi-Fi system and and the the concept.
Here is the stochastic gradient descent.
Just a quick question. Yeah, sorry if you have time. Yeah, why? Most people are not utilizing this like they focus on regularization of learning, great schedule or whatever. But for example, like we don't see something like? Batch size schedule. Example I start with, like, very large to regularize the noise.
Exactly, yeah, why? Generally, training people don't mess around so much with flashlight. Just select effective value like, 64. Yeah, so I don't necessarily have a good answer. Like, my expectation is probably the engineering work that you have to do to do. That doesn't buy you, you know? It's basically not worth it for the improvements that you see, because I mean, depending on your setup.
Like, if you're playing with C5 or whatnot, probably there's no engineering work. But if you're playing with like a large data set. To serve it in different ways throughout and sort of how you're sampling and and things like that. There is another reason maybe correlated to that which is with the change of the batch size.
You need to change the learning rate. There is a Formula of going around of, like, what is the correction of the learning rate? If you change the bite size for that formula is not ground truth, so you know, then you can't really stop Midway to like return your learning rate because you change your bad size.
Um, and then, like, you know, if you're doing Adam or things like that, you might need to, like, I actually even haven't mentioned this. But in Adam, you have at least two hyper parameters, which is the the beta one and beta2. So these are the the moving average of your momentum and the moving average of your sub square gradients those might depend on the base size as well, and maybe you need to.
So, it's just sort of all of this side questions that people don't want to answer. So it's always just easier to have a fixed by size and tune it whether, like, like.
Which is sharpness away minimization, which is a take on SGD, which the main goal of Sam is to find flat Minima. So there are algorithms and that was quite used to some point. So there are variations of atom and so on. That are particularly framed because of this and trying to exploit this.
How much time do we have? Five minutes. I'm not gonna get to the double descend, but to link that for tomorrow, so I'm just gonna continue this. So I decided to say because this is actually a cool thing, so you have to slash everything. This is actually by far.
The intriguing hypothesis that is true in a lot of spaces and. It's like one thing that is kind of helping to explain a lot of the things that are going on. So, so I think if there is any kind of take home messages, you know, understanding the sharp, flat minima Behavior.
It's kind of important because detecting a lot of things. There is another framing of this, which I'm not going to go into the math of it, but uh, um. Is incorrectly they. They did this thing where they're looking at the updates that you're doing. And they work backwards from that.
To find a loss where if you would follow that loss, if you follow the the flow of that loss. So, if you describe personally, you get exactly the behavior of this and and the whole point of this. Okay, so the detail doesn't matter. The whole point of this is that they managed to identify an alternative loss.
Your normal updates are minimizing. And that alternative loss is your original loss, plus the regularization term on the norm of the gradients. And this regularization term on the normal. The gradients is basically trying to do the same thing. It's saying that the Minima you find it has to be flat, so it's coming up in many places.
And the interesting thing about this line of work is that GD has this form. Actually, you get a different form. And then this is kind of interesting, because now you can write down what is the implicit regularizer that GD is imposing, and what is the implicit regularized that SGD is imposing, and because they behave differently, you know, regularized that are different, and you can sort of understand where the difference is coming from.
So, the difference? Sorry, I kind of stiff over this weekly, but I'm gonna stop after this. Just penalize the norm of the full gradient from the entire data set. For HGD, you penalize the gradient on each of the mini batches. And then, if you're thinking about the space. Say, if you have some directions that cancel out in the averaging of the full gradient, you're not going to impose flatness in those directions.
While here, you're going to impose flatness more widely around you. So, that's why HGD works better is because it finds flatter minimum sort of all around. And the last part of the slide is that? While flatter is better. There are also some questions, so one, for example, you assume that about from this hypothesis that adding noise is useful because it helps you go flatter.
It turns out that it's only certain types of noise that are useful, like chat. Ocean noise is not going to help you. The optimization is going to go, so it's very frustrating, so it's. There is something definitely about noise, but it's also definitely about the kind of noise you get from stochastically sampling data points.
Like, for example, here we try to use noise from data augmentation and that hurt you so getting rid of that noise. Actually, you get better performance. Does this relate to the way you mentioned the Deep learning models perform better with natural data? Yeah, yeah, so here is like, we're basically trying to see.
Yeah, I mean, this is just one facets of that, but we're trying to see data augmentation, where we pick standard data augmentation for images. If that noise is better, it's also as good as as noise coming from sub-sampling, the data to kind of force you to generalize better, and the answer is not.
One such a Minima, but like the the nature of the noise that comes from stochastic gender data makes it flatter in the right way, and what that means we don't know. But it's not like any anything that you add on top of your HDD updates will help you. That noise.
Okay with this. I think I'll have to stop here because otherwise I'm gonna run over, and I think that's been quite a bit. Let me see. No, okay, so there's quite a few more slide. Oh, quite a few more sliders. Yeah, okay, so we'll, we'll catch up on that.
Tomorrow morning, I think, oh, thank you, thank you. Thank you.
LECTURE 5 AND 6:
Most separately from the of the model. So the idea is natural gradient is that this Y usually is It’s some distribution, right? It assumes to be some distribution. So this is the whole problemilistic interpretation of machine items. I didn’t go too much under it because it’s, but anyway, the whole point is in machine learning, any model that you have, you can always interpret what that model is doing is building a distribution of our observation. So, when you have, um, say, quite a few, a classifier, what the neon network produces, the distribution over the possible labels. So, for example, if this, so, for example, if you have means by error, you would interpret the output of your model, the Gaussian, and that that would tell you that the prayer is the right block. So, I don’t know if they party briefly, but typically whenever you have a new network, you have to assume that, um, or you can always add a propolistic interpretation on the top of the output of your model. You need to be sort of the right pollishing interpretation that makes sense with the behaviour of your model. So again, like if you really are the, you can assume that what your model is outputting is the mean of a Gaussian, with some big fat, and that’s all the standard is something that people make, right? If I have an anywhere output, then this response, like, for example, I know I know you guys are the personality, like, you have to do this thing in Arabic, right? When you, um, like when you have, uh, um, you like, when you have a policy, the policy is a distribution, or a reactions, and then primatize in different ways, and if you have continuous control, you probably use Gaussian, or you have a linear output, and you assume that that belongs to Gaussian, and something’s not a deviation, if you have discrete actions, then you use probably softmax, and then you have sort of a multinomial, right? They have a disc distribution over the possible action that you have to feature. The same thing holds in all the rest of the machine I need. Every time you have an output. Usually you you interpret it as a distribution. Um, hopefully that’s somewhat clear. But there’s always this – So the – the point here is if you can interpret why the distribution, then you can define distances with the distribution. And the most difficult difference with the distribution that everyone uses is the kind diiveris. So why is that useful is because now I then make my constraints in terms of this tryout. So why is that good? Because as we explain, when you’re building this trans region, you really want to have a sense of how much your function is changing, that is sort of what determines whether the approximation falls or not. And this channel feels a lot more closer to that than just looking at the change in the in the norm of your parameters. So this is the individual, this is the plan to be behind natural gradient. And now what happens is you have to do the tile. And there are different ways to deal with that. So maybe I can even make a parallel to BP of it. PPR actually has to do exactly the same thing, right? So you have 2 options. One is, you leave the biology is, but then you cannot find the solution of the function of organisation in close spot, but you can’t really give it the power. But then you can have these nasty documentations here where what you’re doing is you do a bunch of gradient steps on the kayal, to figure out what you are doing on the above thing. And The original, I mean, I don’t know what you guys didn’t know that. But the original version of VPO, the original paper, it actually had 2 variants of VPO. One was using the kayal, and it was exactly where you this, where you do a single HDD step on the kayal, in order to reserve the variation. So you can do this kind of tricks. I mean, it’s not practically not done too much. But, you know, one choice is you keep that high up and then you can either have this organisation type of information or you do something else. But what we done in actual gradient, we make a clear inflation of the hyalic. And the reason you do that is because you want to get somewhere or you can have a close form solution of the of this. because you can have a close foreign solution, you can easily find the update one, right? And then when you start explaining the title around Gita, you notice that the 1st time disappears, the 2nd time disappears, and the only time that your last week is something or is the 2nd or return, that you do something like Delta. And then dep somewhere. I always get the transforms wrong, but that doesn’t matter, yeah. But it basically gets a 2nd order time, right? when the I mentioned, that is the only one that does not disappear. The other ones, like, this is gonna be like an exercise if you, one, so you can, you can try to do it by hand, but you will see that all the other terms have been accepted. They have to be both the. So that there is no point in keeping them around. So you left with this term. And the nice thing about this term is that this looks exactly like the loyals we used to have, right? where you have this thing as a preconditioner. So you get the good conditioner, just the secondary the type of love of why, sorry, I didn’t write this correctly, but like the feature technically, because you have to have a lot from the kind of comes to the world. So the t matrix. is 2nd a rebel team of love, of why human data, where why is now the distribution, right? Yeah, my notations are not always the best. But hopefully you understand where I’m coming from. I don’t want to this a repeat because it’s going be computer, but we said thatise distributions so now this is a distribution and I’m doing sort of the take about a lot of that distribution. And this is the vision matrix. And again, if you go through the other side all trying to compute that derivative. So actually, sorry, just to be perfectly correct. The free share is actually an expectation over X and everything else of this part. So the expectation is important. It’s not just sort of like a point estimate. This is taking an expectation over all the other, uh, unenvironment we have. But you’ll see this, what you try to do. I mean, the expectation comes from the kayo. Like, this will be very clear, like, if you write down the phone after the kayak, and you try to do the tailor expansion, you’ll see that, well, there’s an expectation that it stays there, it’s not going to disappear, right? Because it’s, yeah, it’s technically X and Y, but I’m over using the word Y. So this, okay, so the distribution, as I say, the way people typically write it, the probability of why, um, given, Sorry, how does it work? next the creation is not over ad is over It’s over the variable. It’s over, it’s over Y as well. over X and Y. That’s how it found out. Um, But again, it’s easiest to see if you just write down the phone of the cayon, you go for a Dipita or not, take the formula, plug it in and try to complete the data. The other thing that I would suggest if you ever want to go through the exercise. Um, It’s easier to do it, but if you do not consider X. Because if you have conditional distributions, you avoid even X, it’s just like you have XRC both to carry along and usually you get the background because you get mostly all the, the background, all the different things. So easiest you can you assume you have the distribution key of Y or whatnot. And you’re doing, uh, the second derivative of, uh, the KL of, uh, the model. But it looks something like this, and you can – you can find the formula of the line, in which you want, the television, but he’s really, that’s what’s going on. You have the higher the the second order term of the higher this comes up in some kind of fashion and this session is typically called the T-shirt. And then the last step that I was saying that kind of connects with other things, is that it turns out that if you expand the session, you get the difficult formula that people like, which is, is the other product of radiant. I not gonna write it down, but basically, they simplifies better to being just an expectation of our other product of radiance. Um And the reason that is useful is because, uh, the reason why this was useful in the early days is because it gives you a way of computing the quantity in a cheap way, because you don’t have to do secondary budgets. You just have to compute 1st derivatives and no other products with that. So that simplifies, uh, this was actually simplifies sort of the amount of what you have to spend. If there is out that there is a trick to up your secondary, that makes them cheap as well. So it’s not maybe the main reason. But at least initially, for me, this was the big thing. It’s like, okay, I don’t have computer second can delive with this. I only have to compute 1st the and I’m done. Um, you know this are particular important when you have sort of early day libraries that wasn’t very clear how you confuse secondary features without. You – you maybe have sort of written by hand, the formula to complete a variance, or the computer very derivative, but you did not have derived, so you didn’t want to code up the formula to do the secondary effective. So what it will tell you is that’s enough. Like you have the fair delity when you do all the products of those in the right way and you get sort of appro for this. So this is where natural credence comes from. So it’s with the same trick. But we use this feature as a pre-conditioner. and the reason why this is useful is because this corresponds to have this kind of const, which is exactly this is exactly the same constraint that PPO has. PPO is basically this, uh, Mine are some differences. Some people disagree, but my mental picture of PPO is really just natural, radium, and television shows. Because really all you do is we do your normal objective and you have a tie of constraint on the policy, which is sort of one entrograde that is trying to lose off. So if you guys understood PPO, right, that algorithm, this is spirit the same. And then PPO is connected to Adam, which is kind of fine because you use PPO and other together. you’re doing the same team for yeah. that was that useful? Can you ask this? Is there a way to, like, get the cheap, cheaper, or… Uh… It’s the hesher is not always the right-up. It’s not even about being cheaper. The issue of cash and you have the negative temperature and all of this stuff. They’ll make 2nd order method or like they like to go back to the standard and so forth, that makes them not that useful. So he, like this whole idea of preconditioner is basically saying that you can find something better. Like in practice, the feature. For example, work better than a Hashan, empirically. Like even if you ignore the computer, you’ll get better convergence from the fissure. Because in a Hessian, you have to deal with the negative temperature and it’s problematic. I mean, this was one hypothesis that I had for a while and never tried in practice. If you take the hash and you do an SVD, you get rid of the negative item values, recompose the head share out of that. I think that thing would work better than the fisher. Because in some sense it will measure the right thing. It’s not practical. But my my theory is that the Hessian is not good to the extent that it has these negative item values and you don’t have any elegant way of dealing with them. such they don’t harm the rest. But if you have that, the Hashan, in some sense, is computing the part that you care about, because it’s really looking at how fast your function is changing and that’s sort of what you care about, like the T-shirt, it’s looking at the proxy, right? You use that Kyal as a proxy of how much a function instead of how much your loss is changing, and you’re bounding that thing, and it’s not necessarily optimal. So for example, imagine you have a very high dimensional output. Say, like if you’re in RL, you have many, many actions. What you hear about, okay, I’ll keep you to supervised, but that’s more explicit. In supervised learning, what you care about is the mode of your of your disposition because you care about like which class are you going to pick? Are you going to pick the correct class or not? So what happens is if by changing theta, you change things in the kale of your distribution or you have very small probabilities, the kale could become very large. And the cayo will say move very slowly because something is changing in your distribution. But for you, you don’t care about it. You only care, am I still classifying it correctly or not. So for you, your basically is going to move slower than you should, because of these changes in a tail, that don’t matter. So in that sense, the KL might be a way too conservative when it’s trying to figure out the optimal stepsize. Well if you have used the loss, the lawsuit didn’t have cared about changes in the payoff because the lawosses only look the law is negative block likelihood you look at the probability of the the most likely class, think. But yeah, that that’s how of the stuff. And yeah, I mean, and in in principle, I guess sort of the intuition is that If you want to play with the optimiser, you’re always allowed to modify the, the, the, the. If you multiply with a semi-positive definite, usually is fine, but you lose any kind of mathematical guarantees, because if you modify the semi-positive, that means that there are some directions you’re never going to move into. So, you know, that means that if the minimality is there, you’re never going to get there because you’re not allowed to move the directions. And if you have many temperatures, you’re going to invert it great yet. So you will end up diverging. So yeah, so that’s the beat you want to avoid for sure. You don’t want to have any kind of neg.. Any other questions? Yes. you should I give to quine? Are you saying you can’t supply by the positive saying the food? That’s me. That’s me. You can, no, I’m just saying in practice from my own experience and in the homework, actually, one of the methods that I’m proposing that does exactly this. You can multiply with the semi-positive, and most of the time is fine. You actually converge to the minimums of the same quality and it doesn’t seem to be that problematic. But in theory, mathematically speaking, it shouldn’t be fine. Mathematically speaking, you lose any guarantees that you’re going to converge to a minima, if you multiply with a with a semi-positute. Um, you need to have immediately difficulties. Um, and, but i- but if you modify with something that has negative temperature, you know for sure you’re gonna divert. So at that point, like you can actually prove the opposite. So that’s definitely wrong. But the semi-positive is part of in between, right? Cool. So, um, yeah, I think that was useful because I think yesterday we kind of, well, the end of the lecture, we kind of went quickly through that. Um, One did like, okay, I’m going to very quickly go through the slides. I went there as well, just in case you tell me something. So I guess, um, There are few points that came to my mind when I was telling this. So 1st of all, There is this question of like, um, I’m excited to graduate this particular question of like, this is meant to be a conversation of the Los Angeles or real, you know, that’s right? And it looks extinly ugly. But then in a few slides, I’m going to say that actually you’re not going to behave very well. So why is the reason for that? So there’s 2 reasons. 1st of all, I would have to go back to this paper to see where this is being done, but probably is not done around zero. So usually you well around zero and behave badly far away from zero. The 2nd thing is this is a projection. and this is sort of making the trickiest part, right? So the low surface of a neural network is this one venial or more dimensional space. And this is in 3 dimensions. So, um, or 2 dimensions or whatever. So there is a lot of information that’s being lost when you go on a real dimension between 3 dimensions. And that’s why these kind of visualisations can be extremely stadium. I mean, there is a whole theory in machinery, where what we tend to do is we tend to build sort of these kind of toy models in our hands that are usually in one V or 2D. And then we apply them to models that are in extremely high dimensional places. And do you know from geometry and math, Hy spaces do not behave the same way as all dimional places. There’s a lot of weird things going on. So just something to be worried about. Like I think just generally, yeah, like I see this being done over and over again, I do it myself all the time. I used to be aware that intuitions, you know, dimensional spaces do not have to follow the high-dimensional spaces. You get you get of have. So I just kind of give you an example and maybe we’re going to talk about it if we end up doing that module. Like, uh, an interesting example is um, people build these sort of generative models, say one about images. and there was a some point a series of papers where you would take a model, you print it on C part, which is realistic images of objects and then you compute the likelihood of badness, which is features of details. And it turned out that he keeps out the images from Amnis were more likely than the original train itself. that people were very like surprised that this is completely out of distribution, had nothing to do with training set, and somehow the model was more confident that that is the right data than the, and the reason for it, I don’t think I fully understand the reason for it. But it’s something that happens all the time. But the reason for it, I think it has something to do with high dimensional spaces, which one happens is because in high dimensional spaces, as you know, the volume kind of concentrates on the sphere. Like if you think of a Gaussian, like basically the nodes, the mean of the Gaussian, it’s a sample you’ll never get out of the money. If you sample a ocean, you’re never going to get the means because most of the volume, it goes into the tail as you’re going to hide on some spaces. So what happens is that because when you train is generative models, you train the inside that, when you sample from them, you get basically the kind of images with train, the, the, all the C file is kind of not in the mode because in the mode, you never sample from it. And basically what it turned out is that if you have an image that was all black, so no colour, that was the most likely thing for the model. That was the most unlikely thing, but another example as well. But accordingly, it’s biking quite a likelihood thing, it was the most likely. So this is one of those things that you would utility, it doesn’t make any sense that if I’m thinking of adoption, inter-dimensional spaces, when I try to fix some data, it means that most of my data will be exactly in the mode. But when I had a national spaces, it’s not going to look like that. It’s going to be a lot weird, right? Um, Anyway, sometimes feel like through this because if you cut the message, like, So just to repeat, so here people have noticed that things are been doing well. And even though there is some bad local mini are there. So here the point of this slide was, and this pleasure with that is that not only that there is weirdness with the lost surface, But, like, I just thought we said, maybe, but a lot of people usually relied on it. So this argument says that or like the spirit of this argument is that as you make the model back there, Not that minimas in value become closer to the global cima. So basically all minima, I will end up having the same error. And anything that has high error becomes a standout. But the way the argument actually works is not that you had some type of local minima. and as I draw the model those develop the min disappear. It’s just that the number of local minima that are closer to the global minima explodes. So it’s like as you grow the model, you introduce new solutions. You are now another tryingized model. So before you had, so if you start with a model that has exactly the right number of parameters, then there is only one solution. There’s only one deepa start in the whole domain equal domain. But as I’m overprimatizing the model. So it has more limited than needs. You start having a lot more solutions. Every dimension that you add to your space, you basically explore the number of solutions. Do you explore the number of typical points. And what this argument here is trying to say is that the way the new critical points that you introduce, regarding dimensions, they usually are biased such that, you know, it mostly introduces standups with high error, and local minimus with very low error. So, you know, all I’m trying to say is that, like say, if you start with a 100 per local minimum, and I increase the dimensionality, how much I want to, those 100 per local Nima will still be there. It’s just sort of very speaking, they’re going to become a very, very small fraction compared to the good local minima. So when I’m trying to do a plot like this, it’s gonna look like they’re not there. But this plot is not saying that I develop and it just disappears. that. I don’t know if it any more confusing or less confusing. But I decided to, like, I feel like, depending on how I phrased this yesterday, sometimes people have this view that as I explore the number of dimensionality, people of the minima just disappeared, and there’s no such thing as a developer minima anymore. And I don’t think that is technically true. And just to give you an example of why that is not true. Say you have a neutralisation why you have all your relatives to do that. So you have all the value for your r to be there. If I grow the size of the model, that’s not going to change. I feel like that, like that minimum will stay there forever. Yes. Is it the, um, the Los Las, is changing as I go to a higher dimension? Is that relatively compared to the total minimum, the difference between this, you know? Yeah, so the loss is changing, um, is the landscape changing, or is it relatively high dimension, and higher dimensions, then the level between the local minimums, and the global minimum becomes smaller, or they are. Not sure exactly how the answer is. So I think, like I said, as you increase, there are a dimensions, you introduce new local minimas, and those tend to be much closer to over minima. The new ones, yeah. I mean, the old ones might change as well. But I’m just sort of trying to say that there’s no sort of kind of magical impact where like a bell out of me that you have in a smaller dimension suddenly disappears or like just close very low, like some of them will just stay where they are because the way you introduce dimensions might naturally affect them. So, for example, like if I, like with that unit, example that I gave, you know, that there will always be there. Like there is nothing, you know, studying parameters and obviously somehow bring those units alive. So, so, you know, there’s some cinema that will just tell you where they are. And and sort of the whole point of this clause is not just to say that there is no, this is more like a decomaker, the probability. It just basically says the probability of a minima to have high error becomes closer and closer to zero, but you’re never going to be exactly zero. It’s just going to become, you know, very, very unlikely. I’m going to keep some of the stuff. So it was a little bit about, I mean, the professional. So we talked about sort of the sudden understanding of opportunity, underfitting, and good models. So we saw the next week in the presentation yesterday. So here, really, what we’re talking about is in a standard evaluation scheme, where you have a train on a test set, which is attached as a independent, unbiassed sample of sort of the value that you actually care about. As you play with the capacity of the mountain, you typically observe this kind of 3 stages. In the 1st stage, you’re under treating. As you increase the capacity, you get sort of where the model is pretty good, and then you start up interesting. And the traditional view of this the reason why this is happening is if you have enough capacity, then you’re going to start sh the noise in the data. So usually the way this is presented, if you have some data, the data is noisy, the data plus noise. If you don’t have enough capacity, the best way to feed the data is to just feed beyond the light data, if not the light, because the noise is gonna be small and jumping around. But then if you have enough capacity to enhold everything, you’re going to start memorising the training set. So you’re going to start keeping the noise as well and your mouth will just exactly present your training stuff. So it’s exactly this speak in the street, right? So in our capacity you’re gonna try to eat whatever nice. That’s kind of the ind reason. And then because of that, the big part of the field, a bit early on while focussing on regularisation. So regariization. So you have 2 ways of doing things, either you just archive control of the capacity. So, for example, control the number of units or the number of parameters, or you have more parameters than you need, and you have regularisation to control for the capacity. The regularisation is basically forcing you not to use the entire capacity. So the reason why the organisation usually works way better is because even measuring capacity is really hard. So people usually use the number of car methods as a proxy for the capacity of your model. But that’s not a very reliable measure because you have structure in the architecture. Like if I put the parameter on the top layer versus I put the parameter on the button layer, that will mean the same thing in terms of how expressive my model is, and all of this kind of things. Actually, if you put it in the lower, you put it a more linear relationshipion deA. Like, for example, if you take that particular construction for capacity, and you, you assume that the number of vineyard regions would be sacrificely of your model, you’ll see if you try to play with those kind of constructions that like if you put a weight in a particular place, you take more regions that you put in a different place. And there’s all this kind of, um, kind of like little details that basically makes it really hard to have any kind of proper measure to facility model. So therefore, it’s usually easier to treat things through recognization. And advertisations can have many, many forms. One of which is really just restricting the number of steps you’re allowed to do in accimulation that will eventually move too far and then we regular gloze a model. The places that will say, you can only represent the functions that you can achieve with data that are very positive. The therapy sort of controlling, all by adding that stuff. and you can give them the extra time anywhere you want. So the kind of picture I was trying to do here is sort of this idea that the regularizer kind of ends up ranking, ranking a little bit minimal because it’s a sort of super imposed field. Another way to think about it is like if you are a regularer Now you have an extra thing to learn. So you are eating up with the capacity to learn the regularizer. So that that’s a very calic way of thinking about how regularizers work as well. just sort of having more piece of information that you have to impulve into the model. Um, And this is a slide, which, again, like I’m going to keep today as well, to be much important, but that’s why it’s basically just saying, sorry, I’m just going to say the credit of it, is that if you’re being Asian, when you’re trying to derive your loss, if you get the regularizer for food, because when you’re being patient, you get that the objective should be minimise the negative of likelihood, so minimise how well, um, the data is represented by the parentheses that you have. Last stay close to your prior. So the prior is basically of POWs, in some ways of parameters. So in in sort of, in a page of formulation, you don’t just have the probability of need type given or travel, you know, the data, you know, the parameters, but you also have just the probability of the prime metres and the probability of the data. And B of the other methods, whether BOW is completely for the prior. So this is your believe of how your project should be before you do anything. And usually log of BOW, this is a recognization term. So like you can you can I mean, this is the whole thing about like a statisticalty of machinery, like, you know, any standard regular island that you typically use, you can trace it back to a product. So like if you assume that my POW has this and this, and then you compute what is the variant of log of that, you get a 2 or 0 whatever regularizer you like. So almost all regularizers, they have a counterpart, probably are somewhere, but, uh, given that’s now where I’m doing something to speak, interpretation, or critarization, as it might, prior information about how the, you know, behave, connect to make this to the consequent conductivewise. Yeah, that, some are typical and varianty, uh, withorizing the other pictures, or less prone to overfitting, even for the same number of parameters, the single. Yeah, definitely like the prior, you can you can think of it as the reluctive bias that you ask. It’s it’s slightly different. at least in my view. the short that is, yes, that’s exactly what. The long answer is, I typically, I think there’s a difference when you change the architecture versus when you are the regularizer, because you can think of the change in the architecture is almost like a hard projection that is like, okay, this is the set of positions you’re allowed to go through. and you’re not allowed to come out of it because it conc to that those things Well, the prior is more like a soft constraint. It’s more like a force that’s pushing you back towards the feasible set. So in that sense, they have slightly different dynamics. I imagine. But they meant to do the same thing. I just think of one more, the hard projection, and then one of the soft kind of words to really buy. But, but yeah, that’s that’s that’s a lot, right? Another way to regularise the model is through data augmentation, and this is by far maybe at least in computer vision, I mean, unless you have a lot, lots and lots and lots of data, and then you don’t care, but by far is one of the most used kind of schemes. And yeah, yeah, there is to take images, to do some transformations that make sense, like rotation, slips, whatnot, crops and so forth. And then you ask the model to predict the same label for both the transform and original image. And this, again, is just sort of like regularising the model because it asks you to learn more, right? Like you shouldn’t just classify this as a cat, but this should be classified in the California so that dig is again sort of like you it has more to that this is sort of a classicive. whatever. And then we ended up talking about charming my minimum. And here, there was another regularisation, in fact, that people have noticed, which is they started getting more and more compute. One thing that people start doing was, drove back to Vega, decent, set of days of graduation. So for a long while, we did the Castilian because it was way too expensive in computer progradium for the entire data set. But then at some point, you know, there were, I mean, if you do tell them, like, you can still find that there are these people, newspapers where people are trying to train an image that in half an hour or like 15 minutes or so forth, using lots and lots of computer. And the trick there was to try to do descent kind of come very fast, right? And when people start playing with different clean veg sizes and then throw out the computer. They suddenly noticed, but, well, they said it wasn’t working as well as a activity. There was a gapping performance. Like in terms of training error, it was fine. But in terms of violation error, the one which is very decent, we’re not performing as well as the classic ones. And this kind of led it in discussion that the optimiser itself acts as a regularizer, regularises your model in some way, and that’s why you have better accommodation. And the hypothesis of how this happens is this sort of short flight minimum. So just to repeat that iteration. So the idea is give you have some low surface. By chance, some like meals are going to be wilder and some are going to be more narrow. And now if you add noise, you are the process, the search process, your optimiser, because of the noise, you are not going to be able to compare to the energy. Because, you know, the noise is going to push you out. It’s going to make a bounce from one side, the other, and if the noise is strong enough and the minima is narrow enough, like that would mean that you jump out of it. So if you have a very flat minima, that noise is not going to really affect you. is not going to change things. So that’s kind of the inflation of the chart card anymore. And just prepared that it’s like, how would this work for validation is, like the reason why this is useful for validation, conversational flat. Minimise, because you can think of the validation laws, just sort of the same low surface, but like maybe shifted a bit or with some noise. So it has to have sort of very similar shape because these are unbiassed estimates of the same funding. So those those losses have to be somewhat for a little sense. So it does it. But, you know, there’s a, there is a more technical argument as well that goes here again into the zero description length and all of this stuff, where it says that if it’s black, then you need life to reach the enhones the data part. But Im not going to go into that by I. And so these are the invasions. However, when you look at this is not okay, what is the regular writing? How can I start a regularizer? So these are the work of David better than others. They manage to show that you can actually write analytically what is the form of that regularizer that these radiant matters have. And that regular rizer is basically just a constraint on the norm of the radiants. So you can write that this is equivalent. This is what radi said that it adds a constraint which is on the norm of the radium. And you can see how by restricting the run of the radiant, you are making the minima clatter. And the difference between radiant centres, the classification is sent, is the trade in itself, puts the norm, all the expected gradient over the entire data set, as mini value as GD puts that was trained on every mini batch in the patent. So this basically flattens in more directions than the other direction. So you get sort of generally speaking while the minimum. This is the case where you start yesterday. I’m just going to check the time drop. I’m just going to go with a list last week and we’re going to have a 10 minute break and I’m not going to actually send a minute period. So the last week on the slide was this idea of noise. So just wanted to add. So I think advance to all of the stories. So the story so far has been, the noise, you know, simulation is helpful because it helps you define platter minima. And you can, you know, work backwards while the noise means and you end up with these kind of regularizers, that you can, they can understand and you can have a sense of what’s going on. This kind of research led to a bunch of papers that were trying to add intrusive noise on top of your SDD. So the idea was, well, if noise is helpful. like why do we rely for that noise to come from from ACD? Like, why can’t you do full badge and then start some ocean noise on top or some kind of noise on the updates to make sure that your dog gets stuck in Indonesia? And this was useful. Like if you have lots of computer, you can do lots of prioritisation. You want to use noise patches, but then you go through your data as a fast start. So, so, you know, so people are trying to move in that direction. And the outcome of that kind of research, it’s a bit mixed, but overall, it turns out that you just manually add noise on top of the variants, it doesn’t have. And this kind of led up this observation that it’s not just about noise, it’s about sort of the distribution that noise is coming from. And notion is definitely not the right distribution. If you have ocean noise it is not going to help you find platin, you know, somehow. I don’t think we have, like, I can’t give you, like, a full, I don’t know, account of what’s going on. I don’t think this is fully understood, but overall, as that was the conclusion of all of those papers, that it’s really hard to get the same kind of benefits that we get from just doing the passive agent. And here, for example, in this particular paper, is on the bottom, we use the noise that comes from using data plantation, to see if it behaves the same as the noise that comes from randomly subsetting your data set, and it turns out that it does not. So even this that feels quite natural, is that we augment your data to create the noise, is not working as well as the noise from your cube. So there’s something about real daytime that might it have inside it. It’s very useful when. Okay, is that like let’s take a 3rd break for 5 minutes, 10 minutes and I was going go. And I’m basically there sort of cancellation effect supposed to be I think Yeah, So, you can see any information, any information? You can see any information. Yeah, yeah, yeah. I’ve seen it. I think it was… But I want to see Sudha. Why Sudha? Ah, high alert. That’s nice, man. This is the paper that says… Okay, so you can convenience, but now we’re kind of… The one thing that I wanted to… I mean, Furu… Okay, so maybe… A lot of times are happening in the future, you know… You’ll find papers that play in X, and then there’s a paper that plays your political X. And you can apply these contradictions all over the place. and there is a good reason for that. Partially is because if we are this highly iderical. So every time we’re playing X, the only way we play back is because you run some experiments and it looks like it true. But obviously there is a huge gap from an experiment and sort of your claim. And actually this is just a, you know, the field is part of being perfect. Methologically speaking, most papers are really bad in the sense that you’ll have a paper that plays some hypothesis, believes that, I don’t know, SGD needs to sparse our solutions and parts of solutions means that you’re going to generalise better. And then they run experiments and they see that we get the regenerization, but they never check the hypothesis. They never look and see, oh, our art department is actually parts there. So this is sort of a trend that you’ve seen lots of papers where they start in a big plane. They then use that plane to motivate how this would in fact performance. And then they run some large-scale experiments or the performance is better. And then they use that to say, well, this this means that my hypothesis is not true. But it doesn’t mean that the hypothes is true. Like it doesn’t happen for many other reasons. This is just sort of yeah, I don’t know. it’s just sort of how the field grew. Like I think I mean, it should part any film, right? Like we’re all trying to do proper size, but they proper size is hard. so a lot of time I take a bunch of shirt cuts. But it’s partially because of that and partially because it’s empirical, like you’ll always get sort of this country, pictory results in the literature. and it’s useful to be aware of that, right? It’s useful, like, not to kind of go too crazy when some paper comes out and says X, kind of like, okay, that’s how it is and it’s not, because most likely in a year or 2 you’re going to have a paper decision of it. Usually it’s the impal that the other hit that makes this happen is the underlying assumption that people make. you know, slight changes in the learning or things like that. We sometimes actually are not even obvious and it takes time for people to realise that, oh, the reason this happened is because these guys did this and they did something else. Like, can the legendary differently or apply to the data and have different kind of data documentation, I think, like that. So yeah, so anyway, here we kept talking about how this like mini marketing is really nice. So the contitory results here is maybe not that cont at the end of the day is there’s this paper from Aragi. and authors that actually show that flatness is not reliable as a manager. So what this paper actually does. So okay, so and in this particular is the reason why maybe it’s not this contic because it all comes down to definition. So the way new attention, if you are, and both are here, and all the pull-up works, have defined flatness, is by looking at the larger cycle value of the head shape, and say, if the larger cycle value is small, then the minimise flat, because all the other item values are going to be even smaller. So it means that like everything is black. If the life deciding value is very high, it means that the minima is narrow, because at least you have one direction where the curbage changes really fast. So this is the typical definition of flatness that people have used. And in order to kind of push that mathematical population. which is sort of indive. It’s also a nice definition because you can compute a life cycling value. You just need to do power iteration once and you get sort of blood side value. So it’s also like imperially it’s easy to measure. But it is work with Laura. What’s happening here is you can get a minimum, let’s say someone gives you a minimum of your function, although you’re in your grandpart. And then there is a deprivatorization that you can do, and it’s a problemization we can contact with before, which is you multiply some ways by alpha and the ways on top by one of alpha. And we know that, you know, another model, this is fine. And when you do this characterisation, what happens is you get back in the same function. So over the entire domain, it’s exactly the same function. But if you compute the spectrum of this function, the lighter silence value is different. And in particular, you can make the light as 2nd value as big as you want or as small as you want. by choosing your alpha, avoiding to them. So what this paper is starting to say is that if I have this minimum I can make it as as flat as I want. But my function is the same. So my function is not generalising any back type. So basically what it’s trying to say is that there is no well like what are you planning to do? there is no causal link to the that theation? Because look, I can make this is fact I want and the generalisation ability doesn’t changes at all because the function always say the same. I just thought I mentioned it because it sort of kind of like an interesting result. Yes, when you say the functional thing, you say, we respect to the input or with respect to the parameters, we respect to the input. The parameter has changed, but that we respected, I think, but like any any terms of imprasis, right? So why is this happening? So basically what’s happening here when you’re hearing these things, and actually, I mean, I don’t know what you said, you can change the lifestyle. I don’t remember if there’s any bounds in how much you can do this. But really what is happening is you are taking from the magnitude of the largest I can value and you’re distributing it over the other item values. So like what you’re doing here is you’re like messing up with the spec from all your iPads. you look on something like this, you can you can somehow move move from one direction to another. You know in a very kind of weird way than Anyway, so this is what’s happening. So there are 2 times yeah. So 1st of all, This is a very nice construction. But basically, if you, if you did that nothing to be affected by it. So what I mean by that is, if I run SGD for a while, I was, so so the direction the parameter space that this privatisation exploits is sort of kind of type here, right, that I want to blow on the sidelines for different minimums. I think it just does not move along that direction somehow. So, basically, what I’m saying is that usually if you start from the same kind of impalization and you run 2 rounds of HDD, and then you look at the larger cycle value of one run and the other run, There is a strong correlation in how they generalise in those numbers, even though technically the size that they shouldn’t be. So there is like this, this kind of, and the other thing is that it’s sort of like what people are thinking about is like, okay, maybe it’s all down to the definition. So how can you how can you do this? So, you know, like one choice is, Maybe it’s not the largest 2nd value is the condition number, but everyone knows what the condition number is. So the condition number, you know, authorisation is that the ratio between the life is starting value and the smaller time. For some point people were like, no, it’s the condition number. so you estimate both the largest and the smallest and then the construction northwards. That doesn’t do a tiger. The reason for it is because there’s another pathology you know that right? where it turns out that you always accept your island that is. So your smallest in magnitude, I can value is always going to be zero. So the lower one never works. Which also, yeah, leads to a lot of issues, technically, but that’s not a good interest. And then people start inviting it by the phrase, but the trace is just a sub of all like the values. So they say like, oh, the platinus is the largest cycle value divided by the sum, or all other dr values. This is somewhat more robust, but you can steal get those numbers to change by playing with a privatisation. Another solution that I’ve seen in papers is saying that if you are properly Bayesian, and you have a prior, then this fixes the problem in the sense that, okay, so the sense that this is the problem, so this sort of problem is that you don’t care only about the output of your function, but you also care about how likely that is prior. So basically, that says that these other models that you get by playing here, they’re not valid because they’re not likely under the fire. So it just says like the prior is kind of eliminating this sort of furious model that seems to behave the same. And maybe the 1st point is that in practice, even with this time yet, and this paper is from 2018 or something, it’s been there for a while, people still use a life cycling value as a metric for flatness, in sort of in defining and creating algorithms and whatnot. And usually it kind of works. And an example of that is Sam. So Sam was proposed way after. I don’t have the year here, but if Lorant’s paper was like 2018, this is like 22 or something like that a few years after. So, you know, people kind of ignore this pathology of like lightness is hard to define and, you know, they just use it and and sound is an algorithm that was relatively popular at some point. I think it’s kind of dying out now. And it typically does get to better minimal. Meanwile, they generalise better. And the whole idea of SAM is it’s like messer of momentum, but we didn’t talk about a certain momentum. The the whole idea of Sam is you do the computer gradient it takes a test in an ascent direction so. Normally you do minus median. Now you plus. So you take a big step in the ascent direction and you get here. The computer data again at this point. Then you go back where you were and you apply this small radio. That kind of video. So you do have to take half step, reevaluate your radiant, then go back and apply the radio. This is sort of technically how that works. And the idea here is that because when you take the assent step, you move in a direction of high temperature, when you’re going to complete this gradient here, this gradient will be a lot more moving in a direction that is trying to clear on the high temperature, right? So therefore basically what you what you’re saying here that youre emphasiz in this component that is saying move away from high target by 1st going where the highpermaturity is and taking the step away from that. So that their argument is that by doing this process, you find mini manager are flat right, because you’re always not only try to minimise the loss, but you’ll always try to move away from where hyperverture is. Because, like, this this is kind of void into the process. Um, and there is a map, uh, which is, uh, there’s other map. but we don’t really need to go. But the argument here is that you can interpret what this is going, what this is doing. so this is maybe just to get the intusion. There is a way mathematically to show that this, like, page step thing, is the equivalent of solving this mean max problem. So you’re basically saying is minimise data. If I took a step that maximises my loss. I mean, it’s kind of obvious. And the point of all of this, I think it’s kind of nice. Yeah, there is a bunch of math. But the whole of all this point is that you can formulate it like this and you can start doing data extractions like people usually do. And then you start dropping terms, so there’s even a comment here whether what they do is they drop this turn because it looks wild. So, you know, you start throwing turns and you start taking approximations. And you show that this is an approximation, it’s exactly the loss plus a turn on the norm of the radium. Um, So the reason I, yeah, so okay. so this is exactly the regular right that we saw from sarcastic is So basically what all of this math is trying to say, this is a couple of paper on some, a lot of this math trying to say, is that actually you can show that Sam is like HGD, plus an extra emphasis that you can control now because the cohibition of this regularizer, this role here, is something that comes from Sam, it’s a different kind of parameter. Plus, a stronger regularizer on the norm of the radius. So it has exactly the same effectightability. It’s like it’s trying to find flood minimum in the same sense. But it just sort of emphasises it more on the brain then. I don’t know if that is true. I’m trying to skip over the mind because as I said, like this is like if you ever kind of want to go deeper into this topics, you know, doing a math is useful, but I think for now, like I don’t I don’t know how useful is for everyone. And the map here is really just linear algebra and the limitation. There’s nothing that it just it just mechanics. Like once you start expanding and moving times here and there, including them together, that’s basically what it is. you are not with this. So this is sort of what So the conclusion here is One that people have been pushing this direction of we want platinum and it does seem to help, even if there’s some contradiction on how you find patterns. Um, in one particular, very successful algorithm in this space is some. And that turned out to be what it turns out to be doing that this approximation is exactly what NGD is doing compared to radio just more so. emphasising the regularier. And that regularizzer just a regularizer on the square no. There is a detail. I think, yeah, it is the paper, if I can do it. which is also kind of interesting. So younger fans and cohors are the guys who did these valvations. They ended up saying like, look, that in the end is exactly just this thing, right? It’s just the loss plus plus an explicit term on the nan of the vari. Why don’t we just write it like this? Like why do we do a lot of this? Not fair, computer radio, go back, buy the radio, business. Why don’t we just change the objective to be this and run with this? And they learn that. And this is the penalty is the orange herb. And it turns out that it worked worse. When they tried it or you didn’t know it worked worse, right? He didn’t work for the original. And they were like, well, what’s going on? Like we’ve done the math, it’s the same, like we’ve applied it, it’s not the same appeal, please. And what the conclusion is, I mean, I just thought it was kind of obvious. The issue comes from the presentations you make when you do the relation, right? So the observation that they’re trying to make is that there is here some high order trying to ignore it. And then higher order terms, refers to the curvature of your mouth. And their point was that when you have bralu, The temperature of the of the mountain, it’s really hard to give you. So this is what this blood is trying to do. So this is a surface generated by a level model. And that trying to idea that if you try to look at a curve chart, there is almost no purpose there because everything is flat. And there is something happening where you have these boundaries. And that’s why it is higher order time somehow as I display. But because the temperature looks so sparse, actually, this penalty form is not able to exploit it because this because what happens in this penalty form is you end up using something like this costs an approximation. And this document approximation is not able to capture those elements. So the argument on this paper is that if you switch the activation function, then I think it may as general, then you see that the penalty and some do the same thing. The that point was that we should might use Ru. Like even the railroad angel loop on a lot of benchmarks, give you the same results. You should use general just because it’s easier to capture higher order that everybody because it’s continuous. So anyway, I this is more of an aside, but this was like a big thing that they were pushing for and we are in a bunch of authors are behind, you know, a lot of this kind of things and, you know, this evolving in Peru and all of these other applications. So I’re just trying to say like we had sort of this discussion of activation functions don’t seem to matter as much at this in domain. So this is one argument where they say, They don’t matter because we’re somehow kind of overfeated to value for a lot of the architectures that we have and you have. But if you switch to more smooth activation practises are there, then you could do things that you couldn’t do before. As for example, replacing the cell optimiser by simple canology. Anyway, I think it was a good observation that they were trying to push. Any questions before I was. Yes. So, do you have, do you have, do you have, before people, you try and see that, do you have, uh, deductation? Yes. So because of the equation from these work space. Yes, because because the, the point was that, um, If you have a more smooth occupation function, your approximation is more reliable because somehow this, like, 2nd order information is, um, is captured by this kind of approximation, like the document, and so forth. So their whole point is like when you’re, when you’re doing this kind of approximations of a curvature, if you have newer kind of behaviour is in the curvature, like, you get from Railu, or, or anyway. Like, basically, you’re saying that like, it’s like, it’s better to have activation functions in the world is higher sets or relatives. So that we can be better, especially, then, yeah. And that’s kind of art. do the issues but with the approation we deep really That’s what issue should the first uh me lot the forumptions. Yeah, so you make, you know, bunch of approximations, and those approximations become more or less reliable depending on how small your function is. I think that’s kind of theument they’re trying to make. So they’re trying to make the argument that the typical approximations that we end up doing, you’re doing the kind of extractions and replacing the rashes, with approximation of the Hessians. There are more reliable if the function is smooth, if the function is not smooth, then this approximation becomes a lot less reliable. And, and sort of, to prove this, they have sort of this, these drivers, in general comparison, and they have, like, the B side level where the B-tab was controlling how smooth the transition was, and show that, I mean, you can you can check sort of the paper for the details, but basically this was the main playing of the paper is that it’s like, well, it might not be visible when you’re just building the performance and use that, which is what people have done on preparing our special functions, and just, you know, take your model, replace activation function, and hypertinent, to see how everything is a machine. you don’t see any difference. But, like, it does make a lot of difference when you’re trying to regularise the novel or you’re trying to pay the model. It is new or not. So this is sort of their argument. And, I mean, they make sort of like weaker arguments of the form that, um, this is potentially Wi-Fi, but a mental might not be that popular for, you know, that crisis because you want to have a well-behaved secondary ability, and if you are to use a secondary ability, well, well, it does not give you a behaved, then, then all this kind of stuff. So they’re trying to argue that we should focus on on, yeah, functions that are sort of what that are, you know, differently. It is the situation, our solution, and we have the visual time, right? So, the area, I have that, uh, from the previous class, the harder question that you can use in that visual class. Yeah.. So, so it matters, but it doesn’t matter as in, yeah, but you are right. So the reason people before study doesn’t matter is because if you look at the blood blue card, which is sort of just, you know, model, is both activation functions give you the same numbers. So if I’m just looking at my validation set, and I’m looking, what the curacy am I getting? It doesn’t matter if I’m using Jaru, or if I’m using Jaru. I’m getting 78% in the best H scenario in both cases. They’re still comparable in terms of performance. When it matters, it matters that if I have the gel, which is wood, then, for example, I can simplify, I can change my optimiser, to use this sort of penalty form, which potentially is cheaper or it’s just sort of a different formulation. And then basically they’re saying that it matters for other things, like not just for if you’re just looking at the like, I don’t know how the frame is. Like I can expressively point, you know, you’re just looking at the accuracy you get on the benchmark, it doesn’t matter. But like if you have these small functions, they’re trying to argue, well, now 2nd order methods might work way better, or like different changes of optimiser might work way better. So they’re saying that there is sort of you’re basically opening the Lord to all other times the things that you could do. I don’t think in this particular case, I mean, this paper is not a game changer because honestly, whether you use this for or you use your original form of SAM, it’s equally expensive. So in terms of like, if you’re, if you’re just like that kind of person that’s just looking at the flops and iterations steps and whatnot, you don’t gain anything by switching to this formula from that previous formula. That kind of equivalent in terms of how population expensive they are. But the point generally is that there might be other things that are not the same. There might be other sort of their formulations of tricks that work into penalties that are easier to compute that would work if you have a smooth calculation function, but would not work otherwise. That side of their argument is that we’re I mean, the general argument that you will see done many times because it is true. Like the field overall is obsessed with benchmarking and we’re on to the benchmarking and I’m talking to the performance of the benchmark, that, um, a kind of lose track of other properties that you might want from the Cisco. And, you know, like, for example, activation function matters for these other properties. But maybe for the best part, we have, and maybe for other benchmarks, we don’t know, but for the best part, we have, because we need to do that from the time, it did not matter. Does that make it slight more clear. check the time as off. Um… So, um, there is another, um, and then now I’m trying. I think there was a, how long brother is capitalised yesterday, about double decent. now we finally going to double said. So we talked a lot about regularisation and we said it takes a lot of different forms. But there’s still sort of, there was still one phenomenon that was absurd, which is this double descent, which will turn out to be a laborization of job. And this is sort of more like historically how things happen. So these are very kind of like one of the 1st papers that was kind of pointing in this sub of 10 direction. So the paper that were trying to say the generalisation is not been doing deep anymore. And what they did is they took image net. They took a typical model that we have, that we had at that point, that was doing very well in Uzbad. If you train the Marvel on the Mitchnet, you get very good solution. So this was the loss. like all of them get 0ots. And then what you do is you shuffle the labels. where you replace the labels with random labels. you train that. and then that gets your training around. And then you shuffle the pizza. So you shuffle the latest, then you shuffle the music. So now you’re basically just learning noise. like everything is just noise. Have you tried to train on that? And you, because you are training output at the door. And the point of this paper was, these models are big enough that they have the capacity to memorise, you know, this amount of random noise, the same amount of data set. Today they have the capacity to memorise the Internet. But somehow if I try on image net, I don’t over it. I end up generalising. So this was… I don’t know why this is happening. So this was kind of the point of the paper, right? Because it’s sort of weird observation that like all of these things about capacity and if you have too much capacity overfeed, somehow doesn’t happen because this model definitely has too much capacity. And there was like another, like, um, piece of information in that paper. I don’t think the author is really kind of like sleep as much. But if you look at the number of steps that it takes you to get to the to learn the image and the original data set, versus the number of the steps that it takes you to to learn the sort of random difficult things, it turns out that you learn, but the generalise much faster than you learn poverty. So it takes you a lot more energy, a lot more updates to memorise, but you can. You have to adv.. I don’t know. So this character is just showing how many updates you need to do to converge. And it says that, like, if I, if the data has structures, if I’m just learning on my internet, I need this learning a place. But if I’m replacing the images with random noise, I can still be up to 0 training error, but I need to do a lot more updates together. that’s kind of the point. So, this is, I, I, this is the basis in my, in my head, of double this up as well. So the idea is that somehow the solutions that are good are closer to you than the solutions that are bad. And I think this is all kind of absurd and this flavour and it’s the bases are doubled. Do I have a question? Yeah, don’t relate the idea we discuss this very good now. So good point. I have honestly thought in this. I think it does. it does. It’s basically saying that the inductive bias Yeah, it’s not necessarily so okay, so… Yeah, I’m not sure actually how to friend. You definitely related in the sense that basically what’s going on here is that, um, like if you are to learn a memorisation, like you try to learn and by memorising the training point. Has this information compared to these random and calculate? Yeah, which is right, I mean, basically, that you need to store a lot more beats, so you’re going to need to do a lot more. I think it’s related. I haven’t necessarily seen this, but I think they’re quite too late. Maybe I need to think about more about our friend this relation, but yeah, I mean, sometimes it’s a very similar story. I just feel like there is comparing things on one particular axis, it’s from anything on other different taxes. And I need to hear about how to project from one to the other, to make it make sense. But, um, Yeah. So if we go there and general this much. I mean, you could be more wise this much. How would we give it to generalised for? Because what’s happened when I, they, they show food, the, The labels are a pixel, so. I see a bit to minimise the rules so much. So how is it going to be with it in our lives? So it’s not going to generalise. I mean, this is this is this is cri as like what look at training there. like the shuttle pizza is like there’s nothing to generalise there because there is no structure. So the the point of the experiment was more, um, okay, let me just, like, we put in things a little bit. There was all of this question that we don’t have a good understanding of what capacity means, and we don’t know what is the capacity of a neural network. And the assumption in the community back then was that sure, this kind of resident architecture or whatever models we’re using to try an image that they would be. but they’re not that big. That’s why we can learn an image net, right? Because the idea was that if you have too much capacity that you’re going to overfit and it’s not gonna work. So the assumption was, you know, they it looks like there’s a lot of parameters, but potentially because of the eradical structure, because of all these symmetries, the actual work capacity that the smuggle has, it’s probably on the right order of magnitude, the image, and that’s why it’s working well. So what is apparently starting to say is that like, look, I’m not, I don’t care whether I’m able to generalise or not. I’m just looking to see what I can enhold the same amount of information that it takes to score image net. If I can like basically force that into the network. So if the network has capacity to do that. So by looking at the sort of random pixels, random labels, this now just becomes a long sequence of random numbers, right? There’s no more structure anywhere, right? Everything is random. So the only way you can get the training there are to be 0 is if you memorise this random numbers and you have as many random numbers as it takes to describe every piece of the image. And the point to this experiment was that, yeah, you can do that. So the model has enough capacity to store these many random numbers. But yet when it’s trying to learn each net where you can generalise, it chooses not to do that. So it has the capacity and it can choose that a way it trains the image and just memorise the image and do sort of a lookout whenever you try to parry the model. And that it will not generalise it, desires like that. But you choosees not to do that. You choose to find a different solution that generalises. And these are sort of kind of the surprising at that point in time. Because for a long time, I think people thought that the mother would always prefer to memorise if it can. And the only way you get the model to do something interesting is if you’re restricted enough, that memorisation is not an option. And that’s what is going to be forced to find a different solution that describes the data as sort of a more interesting way you might generalise. Because that was kind of the underlying principle to a lot of machine learning. I mean, it still is, sometimes, because you talk right now, there isn’t exactly the same thing. So this is sort of one of the underlying processes, like, Obviously, if you can just memorise things, that is the fastest thing you can do to trade. But that’s not going to lead you to generalise. you need to force the model to learn something else and the only so the way this is frame is, I guess, okay, and the else we’re talking about is true compression, right? So, like, the whole point is like, well, you’re forced to compress, because you don’t have the passes for your whole thing, so you have to compress. That’s when you start looking at how these things are, they feel, and see how they can be compressing a sort of smaller thing. So the initial after that one is if there is no need to progress, you’re not going to generalise because then the learning process will just store it became and that’s it. That’s a perfect solution for the training. So you do need to do something to force a model to have to compress all the data that you see. And in here, here, the point is, Well, congression is still happening because this normally I generalise this, but you don’t know why because there is no need. We have to pass to destroy everything. So learning to the ease and each other to just store everything in the way when that’s it. This time of the storyil they were trying to push it.. So now you’re saying when your model is over parameterized, it always finds multiple solutions to the same data set that we give? Yeah, there’s going to be the perspective of double standard. There’s multiple solutions when it’s overprimetrized. One of the solution is obviously just to memorise the data and have this lookout. And I think the question here is why is it not picking that solution? Why is it picking a different solution that is trying to compress your data and do all these kind of things? That was kind of the open question that this paper is trying to pose. Uh-huh. Yeah. Can this be related to the concept of cooking, continual learning, where the modern 1st starts by, like, never in touch, and then it learns on the line, such as, you know, the document. It is directed. Well, it is, it can be connected to that, I can, I can try to make a parallel once I, uh, go to the side of double the set, but, uh, it’s not directly that, that is, there is another layer on top to get to blocking and whatnot. But it is, yeah, there is a relationship. I can talk about it Yes. So, when it’s generalising, I mean, good that in the world, we look at the loss. Moving the steps moving more, most steps, like your overfeed can be used. So you’re looking for solution. Sorry, then you get a episode. So you’re saying that like to overfeit you need to do more steps or why misunderstanding what you’re asking? So when you… Yeah. I don’t think anything. Yep. And so the 1st solution, the gym relation, which is kind of… Yeah, so, so, well, I mean, now it comes back to America, but general, whatnot. But in the paper, and we can open up the paper at some point to check. I remember correctly, in the paper, they explain the training loss is the same. For the, yeah, so here Tita generalises Tita train on the original Instet and Tita train on the random pixels. It’s not the same, it’s not about the same training run where I’ve run for long. So I’m just, so, okay, so that maybe this, this, this, uh, uh, this thing on the, on the right is a bit confusing. So what I wanted, all I wanted to say is that, so you have 2 independent trains that have nothing to do with each other. One is just training on your original image set. The other one is training on the random image that they all get 0 training error or they, I, there might be some issues because like when you look at the loss, like I don’t know exactly how close the losses are. I remember cutting in the paper, they claim that they get the same loss, which is very close to Europe. So they all, from a training perspective, they’re the same. And all this plot on the right is trying to say is that if you now look at how many updates they’ve done into these independent runs, the one that was doing the random pixels have done a lot more updates to get the same training loss of the other one. But they have the same training loss. There’s no difference in the training class. So this was… So I think it’s a train loss is avalidation. They don’t lose the valvation at all. Everything is about train. Like if you look at validation, the random pixels is run on chance, right? Because there’s no structure, there’s nothing there with right. Um Yes, my only question. Yeah. What happens in this country, even though it has, you know, like, really? It’s a very good fashion. I have not run the experiment. My guess is that, yes, you’ll end up over it. Um, and it is kind of, maybe this is sort of, I thought the reason is trying to do. So I have another picture, maybe that’s going to be better. But I’ve listened this follow-up paper came a few years later by Balkan. And it says, like, look, besides the usual, um, uh, drafted museum, with the other cooking, overseeing kind of thing, where the training, uh, you call it training race, but training loss was now, the personal started going up after a while. And then Melkin says, well, there is this special point. And if you keep going after that. So if you keep growing the model side, so you keep having capacity, you see something strange, which is the 1st start going down again. And then it keeps going. It never goes up again. You just keep going down the more parameters you are. And he said, so this is the magic of big learning. And this is sort of why, you know, people have started making real bigger models and never seen any penalty from it because they look like the bigger you make the model, the better they are. The only thing is you have to go over this threshold. which is a threshold until when the picture looks exactly like it’s reflected from traditional machine writing. And then after that point, it’s not like that anymore. So this is kind of the pieces of double descent. If you’ve never seen it before and I’m just going to try to give you sort of the intuition. or at least whyation of how you should think about this. So what’s happening here? is that after this threshold, you typically have one global minimum. that is represented to the model. So your model is restricted enough that at most you have one global eninema. What happens after the threshold, you start getting more and more global minimums because you’re overprimatorized, there are many, many, many choices of parameters that give it is say global minimum. And in general, the number of critical points started floating. And then the arguments that, um, both in it, I don’t think I have there. Yeah, I mean, maybe, I don’t know, if you decided, why did, maybe, I don’t know, I’m not trying to find that. So this is where you are in the traditional setup. You have to, one minima and and, um, one minima. So his point is urge to increase capacity, there is more and more minimal hearing. And then the 2nd the 2nd argument that he’s making is that rated is centre of the catholic range, which is the optimised, right? that we’re using here. will converge to the closest thing by. So his point is if I keep increasing the model, I’m going to have more and more minima. In expectation, minimas are going to start getting closer and closer to me because, you know, they’re just exploding all over the place and it’s sort of like because of that, they’re naturally going to become closer and closer to where I am. And because I’m doing SGD asm to start converging closer and closer myself. And because I didn’t realise the amount of cost to zero, it means I’m going to compare it to solutions that have small and small and more. So they’re closer to the team. And because the solution has smaller noise, they will generalise better. Because, you know, small noise is sort of a autumn razor, is a more simple explanation of the problem. You know, they’re more compressible all of those arguments, no matter how you want to claim them. So here, I mean, I know this is not sort of a, it’s not a most intuitive thing, but it’s one of those big things that happen at some point in the field, that I think is kind of an interesting priority for people in all, yeah. I the increasing propos disease is the number of neurons. Yes, exactly. So each number of neurons or number of parameters so in weeks or in depth or any direction. So this is a big generic. So usually, okay, so there are, okay, so that, that’s my similar topic. So this is a very generic statement and it doesn’t say anything of how you increase the battery. So actually they don’t even say anything about adding Europe or parameters. So you’re saying, um, They’re saying there is a concept of capacity that’s increasing to your model. We don’t know how to compute it. But you know that if you are parameters, you’re going to change the capacity. So as I said before, like if you add parameters of this layer, versus this layer, you change the capacity differences, you know, adding for it is somewhere where you increase the capacity more, I mean, if you’re somewhere else, it might not increase the capacity at all and so forth. So he’s saying like, if you have a mechanism to increase this increasing property that you don’t want to measure, then this will happen, which will be for a life in this interesting capacity. But like, how do we actually increase the compact of the model? That’s completely what they’re going to question, that it is not necessary. This is sort of more of a, there is a map here and he does something for like linear models where using capacity is very, you see. is only one way to increase the basket in the model, or something like that, but it doesn’t necessarily answer. I think it’s an interesting question of Yeah, like how does the capacity of amount of changes, depending on where I took my primary care is. And I can easily real scenarios where there are places I can do parameters, and it doesn’t change like a basically at all, because I’m not the model, can I change the past. So you even have this kind of like pay rewards, right? Where you increase like you have a button light and you keep increasing parameter, you took the button light, actually, you are not, you have more expressive, you are, you are stuffed, even if you keep talking about this. So he is not anwering that after. I can the transportation is using the very to define that it has of the ping industries. Yeah, yeah. So that is one way. I mean, there’s also one of the traditional way of being the capacity. So you can think of it as this number of india regions, in sort of this general, anything. And but within that, like this question of, if I had a parameter or a neural here versus there, like how many more regions can I maximally have, it’s not that easy question to answer, it’s not going to be really important. Like if you are something with the bottom you’re going to get more regions and it add something with a block and stuff like that. So in fact you generally know that it’s not un form. So there are recipes of how to make the model bigger that people typically use. Those are really empirical. There’s no few behind this. But usually there is a ratio between weak and high. So, you know, if your wheat increased enough, then you add some layers and stuff like that. So there usually you always use powers of 2, which is because people like p food. So we increase sizes, you know, you go from 138, you have to say, or things like that. There’s no no just so there used to be a reason back in the day, which was something about the GPO and how you use the DU. Like, if you had thought of those two, we didn’t have any course that were like wasted, because somehow things do not things correctly. I don’t think that argument holds it. I don’t think there’s an even kind of issues that. Can we say that nowadays, generalisation depends more on our optimiser compared to the model capacity? Definitely. In some sense, let me rephrase it. If you not use radiant descent and you use something that’s not radiant-based, you’re probably not going to see this double descent, right? Because part of the component is very very decent liking to compare to nearby me now. So if I have a different kind of optimiser that doesn’t have this bias or converting to something that’s close by, then you’re not going to see, you might see some other kind of behaviour, but not this one, right? It’s not going to be for the reason that wealthing is trying to suggest here, which is you converge to something that’s close to you and because you’re close to 0, This thing is going to be so hard. Like you’re not going to be able to make the target manager. But I kind of agree with that statement. I agree with that. The optimiser is a crucial part in all of this. Like if you remove the, If you remove the optimiser, then you can’t make any of the arguments you’re trying to make here. He’s relying on the fact that you’re using varians and do this. What rate then based matters. I think it kind of generalises most of them, but… Like yeah, if you use random search likely so much. Random set could be an optimiser. So you always pick a random pizza, some performance based. It’s not the best one, but it’s an optizer. Yeah, that would not be that will just maintain or something. I I actually don’t. If I’m running on a threshold, yeah, exhaustively and infinitely, everything’s after the just category. Sure, sure. Sure, sure. I mean, that’s another good example. We won’t get the same… But at the end, we are getting the same visits. Yeah, we, we, we, we kind of search, like, we, so stage, random, so it should get something that we started, I guess, we’re not going to have that pop anymore, you know. Let me check the time as well. you know time I still have a couple. So this plot is, again, trying to suggest a sort of idea of like, as you, you, you, you, you, you, So therefore, you’re going to start converging to things that was and closer to you. At least that’s what I was trying when I was drawing this. I don’t know if it some cool. Um, This ad, an additional important thing, which I don’t think is emphasised in the papers, which is it’s not only relying on the optimiser, but it’s relying on the actualisation, it’s relying on the fact that you start close to liver. Like, if I would start somewhere really far away, then I would not get more more solutions, because, uh, and then conversions somewhere close to where I started. Um, and I think this is important because nowadays we do that. So nowadays you take few trade models and that’s a over very sad because they’re fine tun. So in a lot of like modern, you know, setups, you you don’t train from church, you have some architect, if someone needs you, that you don’t have any control. So you can’t performation. It doesn’t seem to hurt us, but just putting it out there, the like, because you don’t control the internalisation, feeling neutralomatic, because if you start in the wrong part of the space, you know, like just size, it’s not a magically main thing better for you. Obviously, like, I guess sort of the way, so if you have a really good model, you’re pre-trading to prepare somewhere close to where you started and then you find tuning from there, so you’re still kind of close to zero, but you know, the more stages you do, the further away you start going from zero. So depending on how you do this, it can become for nothing. But I just thought people put out there that there is an extra component here, which is the internization and it also plays a role in this argument. because it sort of relies on the fact that theization is close to Europe. So therefore you converge to something close to Europe. And I listen listen. And actually, I don’t have the slide. I don’t know why I used to have this like, but as I mentioned before, there are in conversations you can do where even no matter how big your model is, if you find TuneIn, you get 0 training error, but you get just the chance on validation. So the way you do that typically is you utilise the outer layer to be, like, a couple of orders of magnitude larger than the rest, and that’s all it takes. So you set from the digual duction the way you norm in the lab of ways. So you just multiply down player by a thousand. And then when you find tune, you’re going to get 0 training error. No matter how big the model you make. You can make the model. I mean, I think things changes if it’s really infinite, but, you know, social is finite, but very big. You know, no matter how big you make it, this doesn’t happen. You know, you make it because you want, when you when you when you train it on it, you’re going to get a you’re a training guerr, and then the test is going to take flight, test is not never going to. I’m good to do the lot organations. And, for example, if I want to train VIP dollars on, and this, like I said, for example, basically if I start with, like, VisVIP, rain, rain, or, amazing. Yeah, it worked before you told the worse than visiting realisation. No. definitely we do not do that. Like I’m pretty see that it’s just going work without. I’m just saying that incalceration in this argument, in this whole argument, there is a role. I was trying to make, like I was, yeah, I guess I, I, I, So we have this place where we got from public translation because it’s struck a new plan. In immediately so far, you’ve never seen that be a problem. I think the reason is not a problem is because you start at zero, you retain on image net, you end up somewhere close to zero. Then when you come to you, then if you’re still staying close to you. So there is some religion around zero, but somehow you never leave, you know, this sort of one, because the one pre-training of image needs me landing around the around here, because of this argument. Because the model is very weak, so you’re going to convert somewhere close to where you started. But, okay, my mother is not very good. Like, it isn’t really containing a small, again, anything, and I haven’t reached the physical training loss, I think. Yeah, so the argument I made is, this was the argument of the previous thing, right? Is that this model that we’ve seen, they are this one. this one so this is not a less of hundred. This is a small model, maybe, here, that the VGG, or, definitely, human, that the form. So the point here is this model that we think that are very big, they’re actually very big. So the point of this is that I mean, I’ve never seen anyone doing that. But an interesting question is for example, we take an MLT of 100 layer MMP or 100 each and you try to memorise random noise of the size of the internet. I would be surprised if you get very far. So the point is that even the models that we think are small. are crazy expressive. So, but it’s a 3rd question that you’re asking. So my answer at the moment would be that I think no matter what’s more models we date and what I’ll tell you, is still in this modern intercolation regime. My my argument is that. It’s really hard to get into the classical music. your mother is need to be very time to be in that future. And that’s why like Like people that not naturally obser this, right? We took a long time for wealthy. And I don’t even know if you’ve served this on the networks. So I think there’s a paper by Sufia where another is, but basically I’m just saying, so the typical choices you make, they’re already in this modern population engine. And that’s why I told you that obvious was, because it wasn’t like a day-to-day thing, and you’re typing your mouth, and you see this piece all the time, and people have to go like, it’s also most likely that the example, the game is still in that region when you’re on the sun, right? But it’s a good question. Like we don’t know how to measure capacity property. You know, we don’t know what you feet but the small we don’t know whether’re in the passenger they doing versus whether orre not what doing. I agree, I would say, most of the things that exist out there. I and modern, I would be surprised if same thing is actually trans. I think this is a good time to take half an hour break. and I can continue. And what learning is a, you know, things like writing. is that 1st you have the architecture choice. So this is what our infinity comes in, right? So you choose to tell your functions or you choose on your own network, and maybe you choose some type of parameters, or then you’re, like, how many layers, how many, uh, was the week, you know, whether it’s a form or it’s a comp, then it’s something else. And then we find sort of the family of functions, we have access to writing. And in this thing here is like these things like old age, this is sort of all the functions, you know, I think this is a functional space, and this like big bub is all the functions you could express using that particular choice of architecture and those particular type. But, like, I think what people don’t talk as much about is after you do that, you choose a starting point, and then there usually comes the conditions in here, the setting has zero. And then from Vita zero, you have what I call the reachable age. So just because you can express a particular behaviour with your choice of architecture and so forth, it doesn’t mean that you can discover that by using your optimiser. Youre optimiser is going to act I mean, this is kind of repeating what you discussed so far, your optimiser is an act itself or some form of regularizer and it’s going to restrict the such space. So there is only certain functions you can reach by right for optimisers, because your optimiser has 20 elective biases and you, you know, it’s not just the optimisers. I have this box to remind myself. So it’s not just your to mind that that besides what the rich about age is. It’s a learning value, but it’s also the data, it’s also things like number of steps of updates that you do, whether you include it in your optimiser or not. And all these kind of things, they all restrict your model in some way. And I think this picture is useful. So I got this question in the past and this is sort of how the way I reason about. So I don’t know if you’ve noticed this in the outside world, but at some point there was this big push towards, you know, machinery in general has been sort of like an algorithmic, architecture sense, it’s kind of field where people are kind of coming up with new architectures and so forth. But now people are arguing that we moved into a data sensory kind of view where a lot of people are arguing that Transformer is the architecture, there’s no point in coming up with new architectures. And the only thing that matters now is the data. And then there’s been workshops around this, there’s a few conferences, there’s been like position papers, or perspectives, and stuff and stuff like that, and they are Celtic. and there’s a lot of val today. So it turns out that, you know, the most effective way, if you take one of these big LLMs and they don’t use something well, so they’re not very good at coding. It turns out that the best way to make a better coding is really to change the training data more than anything else. That will have a bigger impact in how theology is. And sort of the one question is why there’s a case. And the reason why is that the case is because the family of models you’re using now, at least at the scale of using them, these set age is super big. And what is limiting us a lot more is the reachable set. So in the past it used to be the religible set was almost the same as age. And by changing age, you had a much bigger impact of the outcome of the whole process. But now the reachable age is much, much smaller. So, and then most of the time when we heat a barrier, when the model is not performing as well, they shoot, is because of this. And as I said, this is partially documentized or partially the data. So by playing with the data, we’re basically changing the shape of these people. However, that’s said, and maybe this is also more like a pedantic painting. There are things we transformer or any animeter that we use now. There are things that you cannot represent. There are things that exist outside of age. Maybe we don’t care about them. maybe we do depend on on the cont example and how we build that. But there are things outside of it. And those things that are outside of age. you can never reach by just changing the data. The whole point is you can only reach things that are inside age and the data controls how far inside age you can move. So you’re starting to make it kind of perspective clear and sort of these where things are. question. One, one, we do have 2 different features of the age putting that big age. Oh, because because there’s phys zero. So I was was trying to kind of make a point that the internation controls as well where you are are function you can report theations, I can exceed certain groundound. So for this in transition, you can only reach this centre of information. Then I was kind of like, what I was trying to illustrate. I was trying to say, you know, like, if you pick a different intelligation point, you’re not going to have the same regional website. It’s still within the, But they all have to be within age, that you cannot move outside of age, but you’re always going to be within age. Well, like inetalization, data optimised, that everything takes you some of these things.. I don’t give what you by each. You think we, we, we, we, we, we… It’s, it’s, it’s the set of functions you can achieve by running of the reputation process. Like, think of the optimisation process and non-natural, the podcasting process itself, where there is noise. So, you know, multiple runs at the same process, you might give the same solution. Well, there’s gonna be a set of solutions that you can reach. I mean, depending on what you’re remove from this box, it’s a different story. So example, actually not a data I say, you know, in expectation of a low pass data is that I have, what if my optimiser able to achieve. And that’s like, if you yes, that’s a very good process, it depends on working with it. And that’s going to be much, right? But it’s still not going to be the same as age. So itself preency, right? Or you can say, I teach the data and I take an expectation over a possible accumulation, right? If I use a GD, I, um, 2nd order methods, if I change learning rate, it’s in a schedule of what class. And then again, I’m going to get a difference that, which about it, and it’s going to tell me with this data, these are all the functions you could possibly learn. And this said again, is going to be smaller that all the functions will express because you know, to express some functions we need new data to provide some product. That’s It’s more of an integrative thing, but I find it kind of useful as a mental one of these. So, you mean, it depends on the architecture and the reachable age. depends on all these things. Right? In the context of Transformers. Can you give us an example where I’m just trying to visualise like how I’m trying to relate it to Transformers, the itch that could be not reachable. I mean, it’s hard to get the exercise example. let me give you an example. So, for example, we talked a little bit about, and I would not end up talking about this again. Okay, I’m just gonna say it might be sort of wrong, you know, kind of ways, but like we talked about how if you increase the context, the transform will break down because of the attention becoming uniform and it doesn’t hurt. So in some sense, there is a limitation of age. So there’s an limitation of the architecture. So, for example, Transformers cannot work with infinitely long sequences, but we need this way. It’s not a practical thing. You’re never going to have an infinite sequence, but by the construction. If you assume just from a purely mathematical point of view, that you try to apply the transformer, give global attention. I mean, there’s all kinds of details, right? Like assuming you do global attention, so you have to apply some max of everything, if you try to apply transformer, on sequences that have infinity line, because of the formula, the soft max, the attention is not going to work. and the transformer is not going to work with an architecture. So trying to learn anything on that kind of family of that kind of things is going to be outside of age. So that would be an example. And then for the reachable stuff, it depends on what you, like, if you were doing a focus on the data, or if you’re focussing on your device. you’re focussing on the data. So like while, for example, if you don’t have examples of adding to numbers in the data, like the transformer, even though technically can represent solutions within domain or adding to numbers, you will not know how to add numbers, that needs to be somewhere in the training set, right? So that’s an example of a class of functions. We need age that you’re not going to reach just because you don’t have examples of adding numbers. And then on the optimiser side, that can be a little bit more tricky, but like this will give you a high level thought on this is like, if in this age, in this set of age, you’re going to have functions where data is extremely large, like say data, like the magnitude of different countries of your data vectors is in, like, in the billions, SGD is not going to convert to those. You cannot converge to those because learning is going to become unstable when you start getting weights of the kind of magnitude, then actually you just have divergent, I’m not going to look like that. So whether you care about those functions, a different story, those are functions are not of any interest, right? Maybe we don’t want, um, transformers that have parameters, we read Mac, and using the 1000000s, but those do exist in this space and those are not reachable just by your optimiser. You realise that I can give you an example that makes sense that you care about it. That’s good enough. Okay.. I have a question about the size of the, like, the initialisation, the case, okay, how deciding, I mean, like, WikiDPG, like, all the both factors, what’s the name of the applicant side? Yeah, yeah. So age, like age about age has been size age. And salization has to be inside the age, also. So that’s kind of by the area. So this has to be the smallest inside of everything, and that one inside the other, which is. Does that make sense? like these 2 eat inside the each have 2 different sizes. Oh, it doesn’t, it does, like I said, I just drug it. So I don’t know if you need to read motivate. I mean, I can, yeah, like, don’t read too much into it, too. Um, So, um, one particular thing that I wanted to like go where it is because I thought it’s is the least Orthodox variant of what I’m saying. I mean, maybe maybe you guys are kind of surprising, but I think positionally speaking, it’s a little bit less not about, is that the optimiser itself can shape what functions you reach. So I’m saying this is a little bit less orthodox because Typically all our knowledge of, or like the way we think about organisation, machine learning, is most of it is borrowed from like the organisation community that is technically optimal compactor conditioning. That’s where most of the security information is coming from, the complex application, you can improve things. In congress authorisation, you have this special thing that you have only one minimum. And the only thing that you care about is how quickly you get there. So if you look at typical organisation, teacher, or machine learning, mostly it talks about feel of convergence, it talks about this kind of things. It feels like everyone cares about how quickly am I going to converge, but they don’t necessarily talk that much about I can use the optimiser to decide which minimum I’m going to converge. And what I’m trying to argue is from this picture, at least sort of theoretically speak, you know, the optimiser can play an equal role in deciding which kind of in my yoga, I’m very sure, right? We’ve already seen that. Like for example, HGG versus GD, right? SGD, because of the noise, will only converge the fact I need noise. So this is a positively difference, the type of UI converges, right? So, quite a speaking the option on there. You’ve already picked the type of minimal as you like to converge. And here I’m just kind of trying to build on that. Yeah, because it’s the next slide, I think. Well, then, like, data meditation, what they want to do with the documentation. now I’m going to skip this style and both the one I was going for. So this is exactly the same thing that I was saying before. So here I’m just trying to make a bit more explicit, and I want to say that like, look, even if I’m starting in the same space for the same my picture, hey, my choice of optimiser, while I’m doing optimiser, hey, what I’m doing optimiser B, can lead me to different minimals. And and furthermore, it could be that there is positively different properties of this minimum. This, in fact, is in fact that you can expect most of the time, right? It’s not just sort of a chance thing, but, like, what if, like, I probability, if you optimise that thing, you’re going to go minimum, a certain type, and if you just optimise that, A, you’re going to go with me on a different type. So I think this idea, okay, let me refrain this. The idea here is that you can encode inductive biases in the optimiser. And I think slightly is pretty cool, and I haven’t seen it as much time in the literature, and that’s why I’m kind of bringing that. Like, the rest of the thing is kind of trivia. Like, we know that you can, enough devices in the structure of the model. talking about. So kind of what this pandemic teacher talks about, right, is by choosing this activation function or by choosing this particular layered architecture and whatnot, you’re adding some kind of inductive devices, you’re kind of explaining, you’re prescribing what kind of solution you want the model of conversion. But I’m saying you’re doing the same when you’re picking your optimiser. you’re optimiser has a different factor. Um, I don’t know that this is what I want to get, right? This is a bunch of funny things. I just want to go to that and then go back. Because I decided to give you an example of how it could. So an example of how this would work is, by choosing your optimiser, you can force your model to commercial power solutions, and I find that we can talk pretty interesting. I want to go an example. So, uh, this is an example of how you want. So this is based on the paper, the and try paper. I got the follow therapist. And what we did is paper. We did not make it into an optimiser. So we had this idea that was about sort of changing the accumulation. because the literature that everyone in terms of architectures, we ended up making this change in the architecture, But then on the next side, I can show you that this train of the architecture, actually, but it’s not just of a change of architecture, and making it sort of changing that detector is actually a bit more of her. So what we really wanted to do here is what we were looking for, is say, we want an algorithm, that you can have my neural life, and I’m training it. I wanted to naturally conversion solution that are sparse. And vice versa, I mean, it has a lot of weights lies here. The question is how can I change my optimiz. So it multipimizer naturally go to sparse sol. So, 1st of all, it’s kind of hard to get them sparse. So you kind of relax that a little bit and you say, I want a big chunk of my weight. very, very small inactive instead of being spice. And then you can you can like spicify it yourself otherwise. You can take any small magnetic weight and convert it to a 0 output. The way we did that is we said, like, if I’m looking at my loft circus, and I’m looking at, uh, for, for, for, for, for, it should be the law service for, so the one character of the model, you should be the law. I want to change my lost service. time that I have a plateau around zero. So I want to add this set around zero And why do I want to do that? 1st of all, I want it to be a seller because I want AGD to be able to escape it. Because we talked about settles. I’m not like local minima, so if you have a saddle, but, you know, you can still escape it and no places where the losses even lower. But while the saddle will do is you’re like learning to be extremely slow while you’re in this battle. So the mechanics that are envisioning here, you’re saying that evil weight is already very small, very close to zero. The energy will have a really hard time to move that weight, the weight, because it’s in a plateau. So what the learning process will do is you will see if it can solve the same if you can minimise the loss in the same way by using other ways that are not the stunt. So if your waist gets stuck, basically the learning process needs to have a really good reason to move it away from where it is. Because they need to put a lot of energy into it, move away from Europe. do a lot of IPD set. So how do do WKA set us everywhere. It’s a similar privatisation where you take your parameters and instead of W, you W raise to power. So you have this form just because you have to take care of the sign of a card. But really the concept is you raise the weight slow power. And if you raise the rate for power, then you get this sort of silence. Right? This is exactly the W cube that we have to it when you’re using the salad. So what happens now? Well, when I run very generally sent on this. As I said, like, you know, I inside the model, as I said, like, if some ways are very close to Europe, they will take close to Europe unless the model really needs them to be used for something else. And you can see this in the distribution of weights after you train, right? So here you use different powers and you see that as you increase the power, your weights are going to concentrate around 0 because a lot of them are going to get stuck there. But the issue it is, is that we’ve done this as a change of architecture, so you change the weights by replacing the weight to this level per power, which is a bit problematic because now if you do other, for example, you’re playing a model, other most counteract is plateau very easily because this plateau is access alive. So if you look at the sub or rather, you would actually get there exactly the correct force to correct this plateau. So the model will not get stuck there anymore. because Adam will kind of compute the curature and will figure out the exact steps line that he needs to do to escape. So when we did this paper, what they had to do is they have to say, well, you can’t really use Adam for this to work. So you have to use this very awkward optimiser where we’re trying to make sure that we correct for some curvatures and not for discurvatures. So you have to somehow, you know, decompose the curvature into 2 components. One that comes from the privatisation that we valid and one that comes from the laws and everything else, and just try to apply the different ways. I’m not showing you how we did this, but I’m just sort of trying to say that because we thought we have to do this as a genuine architecture, it became a very complicated message between kind of thing, but you could do it very differently. Like, if you look great at the sun, and you look at what these primatization does, is when you look at the gradients, if you compute the gradients with it back to your parameters, you will get that in the great, the normal gradients it use yet. If you just have the normal Tita, ties Tita to alpha the minus one. So I didn’t go through the map because it’s not important, but what I’m trying to say is that basically what we were doing in that paper is to use a special preconditioner, and that preconditioner is level. which makes no sense from an authorisation point of view. So you say my pre-conditioner, these absolute, okay, so I mean, the title. you have be up valuation time. So it’s a diagonal preconditioner, uh, because for each stuff, for each weight, W, for each primary, set up. The condition is TTA itself and it’s after value so there is positive definition. Obviously, this is problematic if Dita is zero, because you get semi semi-political definite, and you have issues, but we, you know, I got a decision to be run out, right? So why, why does it be conditioned? So this is, because Ishna, 1st of all, it’s not enough design for widers, because it’s most of the time. But it’s not going to speed up either. So all it going do is leave the parameter is very small. is the makeaker ve can vanish so it’s gonna make you move very very close to five little bit. Slow. If your parameter is very large, it’s gonna force you to make very life steps. And, uh, So it’s exactly the same individual as before. And this worked surprisingly well. So this was basically kind of say that the art when run. And it’s just sort of like a, I just wanted to give you sort of like a flavour of how you could get something completely different out of your optimiser. So people usually when they think about optimisation and optimisation research, they always think about like, how can we make these optimise their computer faster? How can we make it clean like that? But here, like this optimiser is actually slowing down learning. It’s learning much more complicated than I say, well, Well, it gives you something completely different. It gives you minimus, perhaps sort of the same accuracy as before, whether it have a tendency to be considered to be more sports. And then because they’re considering more parts, you can plant like them, you can compare the model and all kinds of things. So I guess sort of the the team notice was just to say that I guess before I decide, I skip. Um, Maybe I’ll done this one. Is that basically, um, anything that you touch is, inside of the pipeline that trains the model somehow affects sort of the reachable set and somehow affects the kind of solution that you have. So, um, the deep learning meat that I we talked about in the beginning, it works, if you respect the protocol, but the protocol is complex, and every part of the protocol played the particular role. And if you break it, it’s even worse. And but it should be like, I think the other thing I wanted to say is I think for a while, it used to be sort of disclaiming of deep learning where the idea was that before deep learning, we used to be teacher engineering, where we managing, we look at images, I come up with like patterns, that we want to do stuff and so forth. And the whole premise of deep learning is that we discover everything from data. We don’t need to prescribe anything. So we have this magic tool, we give you lots of data, and somehow we discovered from data how you’re supposed to use things. But it’s not exactly true. Like basically everything that we do provide inductive biases and without this inductive biases, nothing works. So I’m just trying to kind of, oh, I don’t know, maybe you guys are not affected by it. But there was definitely a period in time where describing inductive biases or adding inductive biases to architecture, what seemed was the wrong thing to do. So, particularly when we were in the phase of like learning to learn and metalizing in all of this, the whole concept was, you should never think about what inactive devices you need. You should never write down what property you think the solution should help. Like all of these has to be discovered from data. and what I’m trying to play here is that actually very, vari things I recovered from data. Like, a lot of, like, we are providing a lot of effective biases, why the choices that were made, whether that is the learning rate schedule, there is the type of optimiser that we, for the model that we use. Hope that are somewhat clear. Was it or? least So that’s familiar with it, in the video, it depends. I think it have a easy… Yes, but you, okay, so it’s like the, the, you could do that alpha I, and that was a lot of concerns, you remember something that you, because you, basically, alpha, for, for latency, you are kind of decapulating back to its normal HDB, which is going to be fast, right, and it’s not the item. So you as you increase our as you regular is no on. You pass inc convergence. So you’re not going get as far solutions. And basically what we found, the reason that the conditioner, then maybe there’s another way of writing is, is that there is a clear trail between how stable and how easy is to train the model and how parts the solutions are. So for us, like, I don’t know if we use our product. I use not. But for us, they’re coefficient. So here, okay, so there’s the people, mission, the people here are work here is not exactly correct. So, the equity, the FIFAD has enough sort of value, so you must be supposed to detonate. And the other thing that you can do, it can very take out some power tea. And that’s another way of controlling it as well. So your suggestion was to alpha identity. if you do that. But if I just raise this, this thing to a power and I make that power to be 0, I get one and that’s gonna be identity. So again, like as I intercolate the power between 0 to the humidity, I’m going through like an extreme version of this to like I do energy. this is another way of intercolnecting. And then we found that regime is the exponent. You get sort of sparser solutions, but worse, like, because you’re basing leasing the size of that taco. make any sense. which And the speed ofmergence be good point. There are always like an exclusit right now thing. So we had to tune, we had to tune the spider bits, like how close you are to the people versus how close you are to. And actually, we have to stay quite close to HD. And if you look at the powers, you have to use its almost like raising this heat to the 0.5 which. So it’s not even one like if you recognise if you just leave it like a sleeper, it’s already nice and almost really bad. So it has to be, in fact, you say, end up using an exponence between 0 and 0 between, like usually like between 0.25 to 0.5. So 5 plus to 0 so 5 plus to 0 is good. Oh. So spicy here is really just a number of ways that I do. So what it is. Okay, so the method is called power product. So you see that in the paper you only have power products plus 100 things. So maybe I’ll just describe the blue one, which is probably plus building. So what that is, is you run this method, you converge the data, and then any way to this magnitude is smaller than epsilon, you just set it to 0 and then just count how many of the leaves. And epsilon, it’s just sort of a very small value. I don’t know, 10 to the -4 or 5 or something. So because this method is It will not really give you exact zeros. So give you things that are very close to Europe, but because everything here is more, is not going to feel it all exactly a deal. So that’s why we have to do this addition of projection to be and really make a small number to be exactly it. know. So the spars are the model, usually the more, the words is performed in accuracy, I mean, that’s sort of the usual thing that people see. But for us, we are looking at a good trade of between house parts and the performance of the model because, you know, the sparser, it is more, it can be compressed, like, you know, if you want to run it on a disability device or whatnot, then you only have this many life of memory or whatnot, you know, it’s quite a few healthy because you know, in writing all these people, right? It’s not ideal. So this is more of a this is audvious this is more likely certain things. Like if you really care about running things on a very devices. You don’t just want ways to be zero, you want rights to be zeros in particular fact, right? So it can actually exploit that on a device. But here we don’t care about the past. I mean, there’s only groups, parts of people, we say, you know, like an entire column has to be 0 or something like that. We just say like inends weight t to be 05. How do you compare with that? basic, you are doing that, I think. how oneization? Oh, it’s very better. So, Regal, which is just about, uh, or, but, whatever that is. So, legal is a, equal state of the art of the 29. So it does a fancy kind of method to introduce parsity. I think it’s still kind of, I mean, it’s still among the best performing mathematically. So this is like way better than our lot if it’s sort of very close to sorta. I mean, it’s been a 2 year, so I don’t know if it’s not that anymore, but at that point, everyone is not even comparable with India Greece. Yeah, other one is not even a place. People don’t even use that one at the baseline. So they use like Miggle or WhatsApp, or it is that they decided to map it, they use. So the difference, the other difference that we that I like about our metal compared to this is this all other tomato that exists now is quite sickly. they tend to use regular projection. So the way it works, if you start with the mountain, you train for a while, then you start sparsifying. So you start to moving weights, instead of zero, and then you find tune, and then you remove some more weights, and then you find tune, and then you know some weight. And you have this whole schedule of how to use multiple plants in a row. And that’s sort of what you do quite well. So what I like about our method is that it kind of continues throughout. there’s no projection except that I mean, I don’t think it makes a huge difference. just for me, I like social idea whether you kind of comm kind of speed. take it. So we still have 10 minutes until a short flight, something like that. pract plus more few more flights when you each section. I wanted to move on. So just a lot about generalisations here and I kind of try to explain how literization comes through. So usually generalisation in the neural networks can be refined, generally, the models they like because they regularise, they regularise properly. And maybe what is special about the learning is that there is a lot of inclusive regularisation that maybe we weren’t even aware of when you started building these models. But as a field evolved, you know, it figured out that we use a gradient sense, that was a regular idea. The scale of the amount of action is a regular idea and so on and so forth. And then, you know, I’m trying to make this overall thing that actually almost everything in their pipelines, where you very complicated, black complications. Maybe look simple, it’s just that it, like, I understand, so that way. So every postponer in that adds some form of implicit recognization and is the sum of all of these regularizers that makes this knowledge very well. So these are standard estism of this section, right? So the takeaway message is that there is a lot of implicit recognization happens in neural networks, and that’s why they work better than other architectures. They might not have those in particular. The synt digic regulzations that will have the model on the way. I think that was kind. But this is all in the main generation that we talked about. So then there is a question of Out of domain. I just not going to talk too much about it. I just wanted to give you I’m just trying to explain a little bit like what is the difference between doing and other domain and maybe give you some flavours. So here you just want to sit on the picture of Indomain and sort of out of domain. And yeah, just maybe to go over the question because I think we should know the answer about the question. So the, the dark is light, and there is a red and a green one, they present 2 different models that have been trained, they are made on this particular part of the, you know, the signal. And the question is, which of these 2 it looks more like a relevant one, like, the blue or the, the, the, the, the, the, the, the, All right. Maybe green? Yeah. Okay, who wants to tell us? or should I just say what a preference, right? Like, why why did you tell his brain? I think the green at least the here the but the red one has no experience. Yeah. So, I mean, okay, so there is sort of, one answer is that we can’t really know. but in principle, if I have C one I dont take in as well. So the reason I was going to say is because it was infinity, he started believing in that. And we know that that Red Marvel service, probably there is, but maybe they didn’t have any. The only issue that I have is that I don’t know what actually… Maybe this red one will start leading remake as well at some point and then because we are… So like really you don’t you don’t you can actually know. But this is again, like I’m repeating, this is exactly it. So I don’t know if I pointed to a paper when I 1st said that, but just pay for31 about having you going trolate it’s exactly sort the one of these papers that you’re trying to make the argument that any value model behave jun nearly after some point. And the reason is that we talked about is the fact that because the novel is fine because a fin a number of linear agents so therefore at some point behaviour. But generally, like answering this type of question is really hard, right? And like the other thing is like we do care a lot about this old degenerization, but we know that there’s no free lunch, right? You know that, like, let’s say, well, I know you, you, like, you know, you, like, you always need to be specifically, because if you say, you know, I want, I have some data. I want to train my mom with a data and then I want to sort of magically generalise to all kinds of things. That’s not going to happen, right? Like, for any architecture, you can come up with a quarter case where the model breaks. And this is true for everything, right? Like, no, cumerals can be easily deceived. Like, we’re not, we don’t sterilise in all new instances, believe it, right? We have particular ways of adapting, but I try to put this visual inversion, which is supposed to be either a radical adapt, depending on how I look at it, just to show how easy to kind of confuse him on the block. that is the point. And, you know, just to even have, like if I have, if I give you this, like, do you know what Apple Pipe is? And I mean, you could guess it. what I was going for, I mean, I, you know, go play the bars, going as far as trying to meet the people actually in the series, but the whole point is that you don’t, like, it could have been people, actually, but when you know, I decided that a fact is strategy, there’s really like no way of saying where things go. So the only thing that allows you to do any kind of extrapolations is by making certain assumptions of the underlying structure of the problem. Um, And as long as those assumptions are right, you can escalate. But you can never know from data alone, data does not tell you what are the correct assumptions to make. Basically, there is an intimate number of functions that would feed the sequence and without having any prior of like anything about the genetic process, anything like that, all of those functions are equally likely. There is nothing that can be, you know, from from your network perspective. That is exactly how the world is, right? They see some data points. There’s like a video number of functions that will feed out there of course. They just need to pick one. They have their own inductive biases that are, as assumptions, and they do choices that, for example, they keep the function that has the smallest norm in terms of time because they have and so forth. But that does not have to be what the underlying process is, right? That is an assumption that we’re imposing. And that works well when it comes to the population, but when it comes to population, this strong normal assumption, it’s not useful. it has nothing to do with the kind of populationulation that we’re probably Oh. So, okay, so this slide is kind of saying what I was saying. So basically, yeah, when we talk about term generalisation, all the generalisation, it only makes sense with respect to some structure, some people like to say in terms of symmetries or other kind of technology that you are, whatever, you know, they like. But the whole quality is, it doesn’t make sense to do research on all the generation, well, being specific about what exactly, what kind of structure you’re trying to exploit and like what kind of inducting devices you’re willing to give the model. But there are examples like this. So for example, you can remember. So, you know, like a generalisation to learn at least or larger sets. That would be a form of all degenerization that’s been studied in the field. So for example, you have transformers, transformers are going to context on a certain side, you want them to, like, you want the type of problems or the types of data that they see is the same. You just want the context to be longer. So another typical kind of OG generalisation setup that can exploit a lot comes in our version of reasoning where you say, I want to change our network to sort, and I want you to learn it to sort a list of a certain length. And then I wanted to be able to sew it longer lists, right? Because I wanted to make. But, you know, although I don’t have this property, that you can always increase the problem size, but I saw it whether it’s tree traversal or whatnot, there’s always a sense in meat. You can do exactly the same thing, but on a much bigger problem. And then that’s sort of one accident where you can ask, I want to learn architectors that can do that. But that’s going to be very different from saying I want to learn architecture that can generalise the change in positions. This is what revolutional networks do, right? Positional. This is a very different type of symmetry and a very different kind of structure that you’re exploiting from the moment of this. Like you say, I want to generalise the rotations or the changes in background and so forth. And another issue with the field is that sometimes some of these symmetries are structures we want to exploit, they’re okay to formalise and write math for them. Sometimes you cannot. And that makes it very, very painful. And usually the way it happens is the one for which you do maths are the ones that people don’t find it interesting and the ones you can do most improve most on. And the ones that you cannot formalise because it’s something vague. Like I want to be, you know, I want to be able to deal with changes in background, but that is not properly formalised. What does that mean for depending on that, but if you, then when I, you know, struggle to say what is the background versus the foreground. It depends on, I mean, we can struggle because it depends on the perspective. Like, someone’s background might be someone’s sport, the other way around, then, like, what we’re interested in. So these kind of things that don’t accept any kind of formalisation are usually very, very hard to live with. But the reason I bring all this already generalisation up is because, you know, we’re living in a world where the systems are becoming sort of useful today and the people using them are expecting exactly this kind of generalisation. So when a person says that they expect the model to do well, they don’t mean that they expect that the test loss on, on, on, the indomain data is low, which is sort of what a machine learning person would kind of assume. They really expect these kind of generalisations, right? They didn’t expect this kind of extension to all kinds of things. And I think this is sort of where things get messy and complicated. Yeah. You just take extension, which I think is very, uh, difficult to mineralisation, because I’m not generalising for every thing, but I’m just extending to shedding axes, for example, without developing everything. So why you still call it generation? So from this line, we’re not generalising anything. Yeah, yeah. So, well, technically if the community people call this either from generation or over the generation, there’s a technical time that people use in the future. whether the right word, I don’t know. Another word that was quite popular is extrapolation. So you want to extrapolate. I don’t like extrapolation because like when I’m thinking of a single line, I know what its population is. When I’m thinking in a high dimension of space, I don’t know what’s interpolation and what’s the population, right? If I have a torus, and I’m beneath, that’s interpolation, it’s that extrapolation, I don’t know. I think these calculations become hard and hard and not, because that maybe I’m not imaginative enough for, to understand to the public, between of its population and serving dramatic sense. But this is where typically in the early days of this, I guess, sort of dark field, which I think it really picks up when, you know, 2019, maybe you start seeing a lot more papers about this and besides some workshops and all that. The initial technology that everyone used was out of distribution generalisation, that was a big thing, then people start talking about extrapolation and then they switch to strong generalisation. Maybe extension would be nice. I don’t know. That sounds like a better term for this, if I want it. But people now more and more are also just becoming a lot more precise and people just talk about invariance to longer list or, you know, things like that. And then I don’t know, going to avoid organisational stuff or something like that. Um, So the, so I’m not going to go too deep into this world because it’s messy, I mean, kind of one wants to, I, I, and if we end up not having time at the end of the course, right, not clear to me. Like, we can go back and, um, there’s some like, really find this stuff, but we can do here, but it’s also fine to me mathematically, so I don’t want to throw in a lot of messy math on, on, on, on there. But I just wanted to give you a sense of what is the moment the most effective submission for this kind of problems? And usually the most effective solution is you decide on what is the structure that market is, right? you are able to formalise it to a certain extent, and then you are able to qualify that as an exactly biased, that you insert either in the architecture, or through data manipulation where you somehow make the sort of make sure it is a data covers, sort of the particular signature that you want to learn. What motation is to be optimise, right? It’s going be less common. And as an example for this, I think the simplest example I’m going to cover that in a future lecture is when you go to graph mural networks, you get to graph mural networks, you know, the networks is a space where this concept has been explored a lot, and you have this idea of of going with alignment and all of this stuff that have been studied that. So, you know, networks, really like the way, like, it’s It’s a nice place when you see what the way you end up doing, you change the voice of your model. You know the sort of the structure of the computation that you want to do. So you change the structure forward, contain some of the computations, and you leave the learning modules to be kind of limited to the component that stays the same, no matter how you change the problem. And because of that, like things, you know, will generalise all the immediately, because the whole point is that for the learning component, the input and all the distribution stays the same. Like, there’s no sense, like when you’re, whatever access you’re going for, like the size of a problem. If you increase the size of the problem, like photographing an advert. So you see the size of the graph. For a given node, it still sees the same thing. It seems like one new footage, 2 of the age, that stays the same. So the function that you are learning, that is just operating on this sort of local information, is always in domain, even though globally things are out of domain. So I think this is kind of a trick. Basically, you are pushing, this is like, you can policy in like the virus, you can hold the something else. But you’re basically putting some extra machinery in the inference that is covering exactly sort of the structure that you want to deal with. And then then the model does not need to learn their structure today out of the data. But like the whole, the main point I wanted to say here is that maybe it’s not satisfying for many of you, but the way we really deal with this is by hard quality, the thing that we want to go with these generations. So if you really think about it is like the fact that the way this works is not discovered from the data, it’s like encoded in the architecture, it’s encoded in the, so in some sense, it’s like by hand describing how to deal with this particular symmetry that we want to explore. It’s very rarely that we actually learn from data, how to encode the, the, the, the, the, the, the, the, everything that can just different example. So the graph new the graph new letter is the best example for this. So the way, so the graph new networks, the one of the premises of graph new on Netflix is that you can change the size of the graph and in some sense the model works the same. And the way graphical network works is you have 2 modules, you have a mod function and an edge function. and I kind of jumping ahead. But the whole point is, for example, that right, you learn to components, right? When you learn the so if this is a draft. So this is a simple graph that I have. So the way the way you apply any new network to this is you have a function app for the nodes, but it’s applied in every node. And then you have another function G that computes the message between 2 notes, and this is applied on every edge. I don’t know have to write it to you anymore. So these are the only 2 learn components. So the way you do inference in the graphic, when I put there’s a whole process where you 1st update all the nodes, so you currently apply after each node, then you compute every possible message independently, and then you have some aggregation mechanism, which is not learned, it’s a condition or match or whatnot, that summarises the messages and the send them forward. But the whole point is that in any of this node, actually is the same kind of input distribution around the distribution. So this app here basically sees its own state. So the function of the node is usually its own state. plus the aggregated message or some of the messages. And then if I add externodes here, For this note, things don’t change. The same sum of the same number of messages that he sees and everything say the same. So the components that are being learned, F and G, that you put distribution say the same. Even though the graph has changed the structure, by the way. So in that sense, for FLG, things are simple. Like it doesn’t need to do any kind of overdegeneralization itself. Like if is applied in domain, so to say. Like the the out-of-domain part is taking care by degregation mechanism and sort of how you put their signals. But this is hard for that. This is written by hand. There’s no learning component there. So therefore, this segregation mechanism is already built in such a way that these robust changes in the graph structure. So in that sense, like with hardcoded the part that deals with the graph fracture, which is sort of how messages are being sent and whatnot. And we left to learn only the sort of local functions that are always look the same, no matter the size of the graph. So this is sort of a way, a typical way in this same life being done. Provolutionary that is another example. Like, basically, how do we become translation, variable, a 5% filter, every password position, and then we aggregate it, exactly kind of the same mechanism. So the translation, invariance, is hard thought is through the combination of the data. It’s not something that’s learned by the filter itself. The filter for the filter itself. There only works and images of 4 by 4 or whatever your filter size is. So if I make my image double in size, for the future, things don’t change, it’s still looking at patches of 4 by 4. And the rest is done by the convolutional layer by the convolving operator itself. So that’s kind of the sense. I mean, going to be started. That’s kind of the sense in which I feel like the main working mechanism that we have to do with this kind of a regeneration is to hardcore some kind of recipe of how to deal with the structure we want to generalise over. I mean, there are alternatives. So for example, But they don’t work as well. So there was a big thing when you were trying to learn rotation in variance, and a lot of people were using data augmentation for that. So then you could say, well, the invariance is learned. But you could definitely tell, like, take your take home and try to either some transformation that makes things canonical in some way, and I don’t know, say, I’m nice enough, that there’s some words in this space, or you try to learn this with data. And you see that through data, you’re always going to have failure modes. Like it’s not going to really learn to be rotation barriers. We’re always going to find some material example where it gets it wrong because learning the invariance is hard. But if you hard call it, then it’s kind of worse to distract you, imagine. Um, Well, if you’re trying to make, Oh, I wanted to give another example of uh, so the other interesting, the other point I was trying to make here is um, about the, this impactive biases. So this structure that you typically want to exploit, what is it like, this example, it’s a bit sort of out there, and I don’t know if it’s kind of really disappointing. But the point I was trying to make is, this kind of structure we want to exploit on a UOG, they’re rarely matching what neural networks already have. So like the main driving force behind your networks is sort of this idea of like small norm and so forth. And I just wanted to hear that, like, this example is like, one example where basically there’s no more solution that we, because you don’t use this particular input dimension is to have a way to be there. but really the young 1 date, I will tell you that actually you are supposed multiply by that. So what I’m trying to say is that like, you know, this translation in various or this rotation in guys, or these things, they cannot be translated by small conservations. Like lots of times they have nothing to do with normalisation, right? This example is probably wrong. The message I wanted to convey here is that, um, A lot of stars, when we talk about our generation, there is very specific structure that we want to exploit, and we cannot fold that structure under some very generic kind of principle of like the simpler the better or something like that. So most of machine learning kind of lives under this this kind of mantra that like the simple solution is the best solution and, you know, open rate and all of that. There is time for indomain generalisation. But when you go to this kind of old regenerization. A lot of times, these are very specific structures that don’t fall under that sort of generic statement. Yeah. in organisation. Are you planning to? Yeah. So yeah, what generalisation is meant to be the same? Yeah So I just, yeah, I mean, this is my dad’s idea. The example is bad, but like mainly the point here is really that a lot of times when it thinks about very practical things that we might want in terms of circulation, those most of the time, I can tell you that those don’t fall under this sort of simple, simpler factor. that’s not going to fix any of this recabation. Why better It’s not better. It is gonna be really isn. Like, it was really nicer to show something that’s realistic that you care about. Because you’d say, wow, why, why is that? So I just kind of showing me that, yeah, it’s good, but it’s very synthetic and it kind of makes sense in my head. I just don’t think it’s unconvincing you of playing just I don’t remember. So how much time do we put We do the break, but just… Yeah, yeah. a couple of slides on this, which is pretty big. I’m not another person, but one of other things that I feel like we should mention it and people should be aware of it. probably that are already a little bit newer than I am. And then, um, in the last, I guess, 10, 15 minutes of the lecture, I might start on the next module, which is continue learning, but we’re probably just going to scratch the service again, and then tomorrow, it will be different. So, I mean, I think, again, like, I’m assuming a lot of you are aware, but this is all the, the, the all story that correlations is not the same as uh, uh, foundation. So, here you have some simple graphs that are showing that compared to your family, where you have like things that have nothing to do with cetera, and somehow they’re very highly relative. And the reason this is important is because the driving for you driving more networks is really correlation. So maybe we haven’t talked that much about it, but maybe it’s not an obvious when you’re looking at something, but like, you know, the basis of like, if you’re looking back before, maybe in the sense, you have a, maybe a learning, where we did the, the mantra that will summarise the kind of learning, the approach was things that fire together, stay together, something like that. So basically, the 2 things that are, I, I, that’s where you made the connection from that, right? If you look at that, actually, that’s kind of the same thing, right? You see that, uh, an eagle and a leg or you come together, then it kind of reinforces the pathway through the architecture that is trying to make the connection, which is that important. So, basically, at a high level, what human methods are doing, at least in a standard supervised learning, when you have a peak data set, and you’re kind of doing many, many updates. And those is looking for correlations, and it’s kind of exposing those correlations in sort of interesting ways. And the point of this is to show that, well, like sometimes correlations communicating, right? So there might be curious, um, features that do not connect to what you want to connect. Um, So I think, um, I, I, I mean, we can, we can spend more on this, but really that’s, I only had this like on this other side, which I want to play, but I’m just going to say, you know, about, what movie was trying to say. So, Usually, uh, the way this is being framed, that may be sort of one way that people are trying to, like, uh, we, you know, is there a way out of it, is saying that, like, it’s computer observational data is really hard to extract closer structure, because if you just have period of probation, you cannot distinguish between correlation and brutality. But the usual people, the the, the, the calls that they have is that really if you’re introducing personal learning in the pipeline, then you can hope that RL acts, you know, basically collects interaction data. So it assumes there’s something for life strongly, they try to go there, to exploit that, and then it will find out that actually there is no, you know, the things break down when you do that. So I’ve seen this argument. So I know the so, okay, necessarily, I know, but I know that the totality folks are not convinced, right? And, you know, a lot of people say that, no, like, if you want to really have a total system, you have proposal learning, you have their own tools, and their own operations, I’m not going to talk about it, you know, way they do that. If you talk to more, like, people, some of them would say, well, we already have a way of doing this, which is sort of a real, and maybe, to someone else and tuning on a big thing, then, like, all of this, where is kind of great stuff. But I just wanted to highlight that this is sort of, I mean, things have changed and I guess people are more aware. But this is the kind of thing that many times a lot of us do not think about it. I remember when I did my lectures on newer letters back in the day, like this whole aspect of personality was never brought up. and I just to give you another example of how a lot of smart people kind of get pulled by this. Um, so when public happened, one effort that was starting to happen between Google’s, um, today, what happens if you pull all the data is Google, you have access to that, all the queries are being made and so forth, kind of try to predict the number of cases that are happening in the region. And obviously this is a horrible idea because you are just looking for a correlation. Basically, like there is no predictive power. So you have like extra factors, like whether and how what queries people do from the region and they were trying to say, can we take all of this compared to some big feature that, and try to learn from it, some kind of predictive model that will tell us sort of how the, the, and this was kind of just sort of ignore the, you know, the cause of strategy, the underlying problem, you know, the, the, the thing that people knew about the virus and how the virus spreads and whatnot. And obviously, it wasn’t predicted because, well, there was nothing to predict. Like when you have this very high dimensional vector, then maybe you might be other causal variables within that plug, but there were definitely a lot of things that were just for relations. and only there’s a sort of huge backdory that has very storm correlation. like it’s really hard for the model not to latch to them. And the only thing that it does. It’s kind of nice, sort of, weird correlations that have nothing to do. Actually, as an anecdot when during COVID, There was a task force in the UK, to deal with the club and UIT was my manager. was a researcher, you know, as well, he was part of that task person. And I think that he was saying is that the only thing that ML contributed to COVID, uh, in being OK at least, was, um, spy coach. Like it turns out that none of the machinery or noneone of the technology that we have been driving Mac was usedful for anything In the end, the only thing that really helped was hard-core statistics. So they had a bunch of staticians, but those are the only ones who build models that made sense. And the only advantage that came from the another community is that they ended up using 530 because it does better optimise for the computers. But it was, at least in my mind, it was like one of those big failures. You know, we keep talking about how the eye is going to solve all problems and whatnot. I think if we’relo back to COVID, I don’t think I helped that much. I think in years, well, how, well, statistics, uh, sort of a lot of like all school knowledge that we had on previous years. So that’s sort of just again, my yesGI has the potential to keep a lot of problems. Well, one needs to be careful. Like so far that there is not, it hasn’t solved that many fundamental problems. It’s on the type of maybe solving some of them, but so far I don’t think there’s we’ve really done that many things that really made our life that actually other things do. but yeah, we have a little. I’ll say, I was with you are in the same. It’s the same team and the pushback against I’ll stick down is in the area. The pushback against AI was twofold. One is every government said. If we lockdown schools, what’s the causal link? Should we, is it going, is currently going to spray it more or is it going to be, is this a good thing to do? And then you’ve got all the status correlations in the machine learning model and you can’t actually answer any causal question that people really cared about. Like should you stay home or should you wear a mask. or not. That’s the 1st one. The 2nd one is the entire AI community said you can make these amazing future predictions and the government said, where all the error was. Give us a variant estimate, like of the number of days that going to be available in TV style. We don’t want the exact number we want to spread. It has different audio. And everyone is running the school out and ends, and what’s in their system? And so that’s why all of this was literally burned and all the classical statisticians where it had with the gold medals. the other okay, so this saw sort of on the on the theity correlation bit society. that’s a very interesting That’s a very important topic. and I think maybe nowadays it’s much more of a topic that it used to be, but as I said, when I was when I was learning about machine learning and all of this stuff, I thought this question about personality and like what is the difference between the Latin and then just keep the finding correlation, wasn’t that big of a topic. But it really is something that we should keep in mind. Because, um, like we are, when we are interacting with this big system, right, whether it is typically Gemini, cloud or whatnot, we are assuming or I feel like a lot of people are already assuming that they can do closer reasoning. And, and, like, from personal experience, the way we assess, like, okay, so how do we assess whether the model does for the museum? We have a bench flight. I don’t think that’s how you assess personality. Like the best fact can be over tip it, the best fact can be, you can. So like I just basically saying like can this model as well. I mean, they have this, you know, I a check during another stuff probably can do other things. But I personally will not go as far as, say, you know, I trust this is a strong engine to do closer reasoning. I’d rather do my clos reasoning myself. I’m probably not perfect, but I would rather trust myself than trust one of those, my people. But it is sort of a thing, right? Like I, you know, people do use them really poor that a lot of times when they practice the system. They just kind of assumed that they can sort of reason that in that particular sense. Um, you know, depending definitely doesn’t happen with that. Okay, so that was one thing. Yeah, again, I wanted to bring it up. is that another thing that’s going on here that this is not about the mystery potality to normal, but it’s about latching to curious correlations to a certain extent. So there’s a lot of these kind of experiments where what you do is you have, sort of both, you have 2 features that are both highly correlated with the data, you have a very small pure examples which show that one of the future is wrong compared to the other. And you make sort of the relationship between the future that is wrong, that you have, I don’t know, an example, because that is wrong. The relationship between that feature and the label is much simpler. it’s like a linear relationship. And the relationship between the features you really care about that is always predictive and the legal is much more complicated. I don’t have any data for this. But anyway, like you can do experiments like this and then you can show that once you run the learning, let me go back to the to the linear feature because it’s much easier to buy. I mean, this sort of, um, sort of like a natural thing for what, for the neural networks to do and um, impartially in this position paper that I’m working here from under Turkey and others. The claim is that this has something to do with physical learning and the fact that we’re kind of minimising the average case scenario, rather than do what they call exact learning, which in some sense is basically minimising the worst case scenario. So there is this sort of other perspective, which is an interesting position type, I’m just bringing these up for those who are just kind of generally just in these kind of topics and want to know sort of what are the newest thinking. So this is a, this paper, disposition paper, that you can find, which is probably a physical learning, um, you know, was rejected from your, it is being submitted now by the amount, as far as I know. So it is like it is a very recent stuff and it of other stuff. But this is, you know, it’s just sort of one perspective. But I think it’s kind of interesting to explore. It’s kind of interesting to kind of get this sort of wide variety of perspective that people have on these problems. So, um, one thing that I found quite interesting in sort of on grash position was that if you really between the lines, his objection is not really just about the average case versus worst case and that you really need to learn about the worst case. He is trying to argue that actually examples might not be enough to learn certain things. So, like, let’s say the, for example, I’ve been talking a lot about investing numbers. So you’re saying that actually, if you just set examples or adding numbers, there’s no way of discovering the 2 adding number algorithm out of it, because it’s not sufficient to derive that. So his point is that when we are in addition, we are told what the algorithm is. So he’s, the perspective here is that when we’re talking about feelings that we want to generalise, so like the addition has this property of like, it almost like forces you to think about all the generalisation, right? Because you have, you see it in the domain and you want to always be able to go further and further away and kind of keep in the algorithm correctly over the entire domain. And I think the point that he’s trying to make here is that for algorithms, you there is some structure, some assumption that the algorithm is making, and that’s sort of how it works. And this is sort of the structure that you want. This is the symmetry that you want to exploit when you’re trying to learn the algorithm. But the point that I think I’m just trying to make is that this structure is not necessarily universal, is specific to the task that you’re learning. So his point was that it’s potentially not feasible or not the right way to think about it, that you want to encode all possible sort of meaningful symmetries or inductive biases in architecture, maybe sort of specialised architecture that can have all of these inductive devices, and maybe from the data, and then you can do all these different things. You know, saying that when you ask a system to solve a problem, you should specify the system what is the structure or what is the assumption that is allowed to make. And arguing that this is sort of how we actually operate as human. Like, when you learn addition, like, someone tells us what the algorithm is, when we’re trying to do, like, do some other, we, we usually represent it, we, we’re told what the structure is, and we, you know, some examples from, to move forward from that and get the algorithm. And sort of, this is sort of the hypothesis, the number that you’re saying, which is not just about moving from the average case to the worst case, in terms of learning examples. But you also need something else. If you have to do addition, basically, 1st you need to provide the LLM, the addition algorithm, you know, rate that it makes sense, and it can be exploded to the induct device. And then build a bunch of examples and together with another in description, the model can infer and imitate the algorithm from that onwards. This is kind of the hypothesis that he was pushing for. Obviously, there are 2 things here. doing worst case scenario is that easy, potentially you start with average case scenario and switch to worst case scenario or afterwards. otherwise like nothing works and you don’t get any kind of signal. But it’s not sort of a 3 wheel thing. If it’s a real trivial thing to learn from worst case, you know, people would have done it. Um, a lot of um, in the past. And the other thing is, like, all of this idea about past disputers, or like providing the algorithms, is not clear how to how to do this. My punch, if I put something in the context, is not the same thing as giving an inductive bias in the architecture. Like, just because it’s in the thumbtext, it doesn’t mean that the model can use it the way we think when it comes. So potentially sort of what, you know, if you want to push this direction, you need sort of another way of framing this sort of pass the script person, providing them to the architecture, such that they have sort of the right kind of impact on the inference so that they can act as sort of a handwrite on them directly. So both of these are kind of like open-handed questions and on and really knows what to do. But anyway, I wanted to provide this more a many perspective for this. How does it alignment gives it their name? You know, that’s really important. Yeah, so this is like in the spirit of future learning. And actually, the 1st thing that I’m actually was, like, what happens compared, like, topfish are learning words. If you just give examples and try to predict versus you give a description of the task and examples, then you try to predict them. And the topwise, when you give the description into a new way back there, because it really has, it turned out that he did not. He did the same. So that’s why, you know, he was kind of like, well, that means that somehow he doesn’t know how to use this extra piece of information that I provided the description of the algorithm. So, um, it’s not always the case, and hiking nowadays we were, I can give you a planter example to the experience that I’m Russian. And this is just something that I have in the slide further on, but I mentioned now, as the example, we were looking at using a model of a sample from a random number generator, right? And if you ask, if you ask your whatever model you like, if you ask it to sample from a distribution, it’s not going to be very good, it’s not going to sample anything interesting. Um, and then we provided an algorithm to random number generators uh, in the context and then we are, with the algorithm, and then it is better. It was still far from the right thing. And if you ask it to someone twice in a row, you know, they will. you can sample once. The reason you can sample twice is because a lot of random number generators, they rely on they decide with a seed, which is a phone number, but that seed usually very quickly explodes for very large numbers. And then we need to do operation with very large numbers. and then I would have come to that, right? So, you know, I need to have these numbers. So, I don’t know how many, many digits, and LLMs can, uh, under it more than, uh, a few digits. So, so, and then you have to multiply, you have to take powers, you have to, depending on what kind of genetic you use, it’s these unmatically function of very large numbers. Okay, so let me check the time. So here is where I think we should have 15 minutes. I’m just gonna try to go very slowly because I think everyone is this pri. Um, I’ll tell you my talk. But this is why we, like, this is why we switch to, to adaptability. So, like, so far we’ve talked, so maybe maybe I just kind of introduced the topic and then we leave the restaurant tomorrow. So far we’ve been talking, if I’m going to try to structure what my lecture has been so far. So we started with some, like, introductory stuff, where we quickly jumped into kind of expressivity questions, and we talked about sort of what, how you can measure sort of the activity of a model, why models look like they look, and then sort of things like that, and then come back, we switch to organisation 1st and then we talk a little bit about generalisation and regularisation, and sort of also, you know, things like that. So this overall is basically the backbone of all the machine lighting. So we haven’t really thought, and after this action will come, a section when we talk about comp nets, um, graph networks, retirement networks, and transformers and stuff like, you know, architectural designs, and sort of specific things about them. So far, the things we talked about, we talked about our relatively generic. So the most of them, like the mental features I had in my mind was the MLT. most of the stuff we talked about was the MLP just as simple thing another. Here, like we’re gonna look at a different side of learning, which is a description of other stability or material learning. So he only asks the question of how can, like, what happens if we ask the system to continue adapt as the world is changing? And sort of discuss a little bit of why this is not working. What are the main issues? What are some of the algorithm exists? and stuff like that. But then after and then with that, like, we basically want to wrap out of the reflection that is focussing on learning, and then we want to go back to architectures and then talk about tractors and stuff like that. Um, The other note that I wanted to make is that, um, VMLP. So maybe I should have started with that to get you guys more excited about thinking that much about the MLPs. DMLP is the most important component of the transporter. So like the attention doesn’t ask nothing that much. And actually you can replace the attention of our stuff when it comes out say. But the MLP is a bit as a heavily speed. So, understanding DMLP is actually extremely useful, because in the end, it feels like the, the, the, the, the, the, the, you know, the, the, most of the job, and, you know, that happened afterwards. Um, I don’t know what anything but let me see what time the next slide. Okay, so just maybe I’ll – I’ll go through the motivation or how much should I have? Then motivation. So why is other community important? So this is, this is supposed to be your API of the day, so I chose a cat. Sanija is being used everywhere. And because of this, I guess we were discussing about this before, because it’s being used in sort of so widely. Very often, this agent will need to interact with scenarios that they haven’t necessarily seen training, but it’s very likely, depending on how the agent looks, that they might be forced to get in a setting like that. And immediately it actually worries you can think about it. One is it kind of very very bit bad. The theme of this is a extrappolation problem or the tradition problem and I try to think of how we solve this population problem. But another frame that is, I think, people are thoughtful is to say, these models, they don’t need to extrapolate, they just need to adapt very fast. And the fact that they adapt there entirely, continuously, limit the interact, is going to, from a user perspective, is going to feel the same. And it’s going to be good enough. And then there is other reasons to do this is because essentially that’s how biological systems deal with this problem of God. So like one trait, there’s very specific for biometrical systems is very adaptable. That’s sort of how we survive. So the motivation here is that right now the systems that are mostly focussing on scale and focussing on paradise benchmarks. And the progress so far has been really driven by this idea of let’s look at the performance on the backmark. Usually there’s a number, and then you do whatever it takes to push the number up or down, depending on pretty much. And this is good because, you know, it worked. the reason why we where we are. And it’s easy to justify, it’s easy to predict. It’s usually to kind of build a business around it if you want to do that, so it’s easy to do research around it. Easy to know if you’re there yet, in terms of writing the papers, immediately, if you know, why am I getting good numbers and depending on it or not? But however, this is not like what we actually move or expect. So like if we go back to, you know, we started the question, or discussing about machine learning, or sending machine learning is about learning class, it’s not about solving bad smarts. There’s a huge difference between saying, I learn how to add numbers from saying, oh, I got, you know, 99% of previous, you know, this particular peak set of, of, of other conditions that someone gave. And I think sort of as a community, we want to move further away from the benchmark. I think the benchmarks right now, I’m becoming more and more cinematic. Um, And and what I have on the following slides is just basically a list of reasons, the wider from the market. And let’s see if we can go for a couple of them at the level as well. So one is the reason why benchmarking is complicated or not enough is because the world changes. So this is sort of the standard training that you see for most professional and research. You’ll say, well, the world keeps changing continuously, sometimes unpredictable ways, there’s always many things that come. So any artefact that is closer is not going to be able to get at that range. And market person is not going to follow us.’s happen. Um, There is um, There is a thunder argument to this perspective. which is it’s a big sort of every 3 here and I’ll see if I can do it justice. But there is a question whether this change that is happening around us, whether this is something that should look like the learning part of the system or the memorising part of the system. You could argue that maybe this is just about like sort of keeping track of some variables and kind of providing the learned system, but a beat that you used to learn is pretty stable. Like there’s a behaviour and then sort of there’s a conditional behaviour on the memory, you know. But then being very clear. But there is this sort of idea that like maybe there is something static that we can learn, which is sort of how the system interacts, say, with some some data set or some memory, bumper, something like that. And then other world changes, the only thing you have to do is to update memory. In the case of the transformer, another way of paying increases. Many all of these changes, the world can be take a by studying stuff in the context. right? Because one of ourselves is fixed. It’s good, and yes, the world changes, but you can also change the context and the context access, memory, and you can, it keeps other ways wrong with the world, but just keeping track and adding stocking funds the way you should. And throwing away in from the context that I’m not throwing my little mind. So this could be a framing of how you do this. And, you know, that doesn’t mean that this is not feel like a machine learning problem, but it’s a different machine learning problem. We don’t need to talk about any other design than a new network because we don’t need to change those. So those we can train, you know, to stay a certain understand. And then we only talk about context management and how to swap in and out things about the context. Another argument of why this sort of pathic regime is problematic is because it’s really expensive to handulate. So, um, you know, if you want to get this kind of system that behaves well in the world, we need to emulate not happening kind of everywhere, which feels like we need lots and lots of control data. And then you need to make this data ID, which is also very complicated. It’s actually not a trivial. If you have like 3D and actually have a purpose, it’s not trivial to sample from those coconuts in a way that actually I leads to different. So, you know, there’s all of this question and I, you know, you have you have questions of how do you collect the data? Like how do you ensure coverage of everything that you care about, like how do we know that, I don’t know, we don’t have some people that sort of in our data collection pipeline and all of this. And all of these questions becomes harder and harder is clear because, you know, like you, how do you know if you have a gap in your, you know, where they have it collected, you can’t even impact the data because you’re so few, you can’t summarise it, then, you know, like it’s sort of all of this kind of line forces that make it really hard. Um, So, um, so, so this is another argument of why just focussing on stationary and scale is not, is not the right rate because it’s really hard. The counter argument for this particular perspective is, sure, it is very expensive and hard, but you do it once and then you train your model and you’re done. that be enough right. And maybe this is the last, let me check with that. Okay. Yeah, I’ll just serve the last one. The last point I wanted to make is, um, The other thing that is happening, which scale and sort of the static of June, is that performance becomes really hard to measure. So the way we have the fine performance, at least from statistical and concept, the way we think about performance and everything, it makes sense only the sort of standardised scenarios. So you know that the valation set is sort of it’s independently sampled and the similarly distributed and so forth. But when we’re working at this kind of scale, there’s no way of controlling for train test information. Sometimes you’re not even clear how you define example, but, for example, if you have the same sensors, but you have a different word related. So it probably changes the semantics of the whole thing. So, you know, I don’t know, you’re doing the next topic prediction at what point is this contamination, at what point it is something different. It just becomes a really, really, really hard problem. You also start having things about like, you know, who want the system to have here, how can you measure if they have appeals or not? It’s just like none of these things are easier to elaborate anymore. Um, if it helps, like that’s why uh, marriage now has a track for just for evaluation, for elaboration and benchmark, you know? I’m not mistaken. And you just never try to highlight this thing, then we got to a point where we actually do not know how to evaluate the system, then not know how to compare them, decide what each one is better or worse. You have all kinds of things that people are doing, like red teaming and human evaluation and all of this stuff. But all of this is, Yeah, it is hard. Basically, it is really hard that the promise is that if you switch to a system where it’s all about adaptability, that is really going to make things drastically simple, but if we are changing perspective, where the goal of what you’re doing is not necessarily how well you’re performing, but it’s more, how quickly are you adapting to any changes. And that might be easier to measure or might be easier to formalise. If the main focus is not what is sort of your overall performance, but then how quickly you react to anything near the transfer environment, they could force connection to the other kind of methods that come out of it, that will be sort of all of us. work I mean, I have other stuff, but I’ll start there because I promise those stuff. And then the other ones we do not.
LECTURE 7:
There are some things that maybe I make them sound like being super, uh, like, but, like, professional evaluation, I mean, six, I mean, there are people working on it, but it’s a bit of a niche topic. Everyone by far just stays within the standard IIDs, you know, learning with you. Cavia to that, as the field has grown, we have kind of stopped following the protocol correctly. So there is a few things that are happening. So for example, I guess, you know, if you train models, you probably recognise a lot of these things. So 1st of all, no one really samples ID from your data set. You shuffle your data at once, and then you go multiple times to each in the same order. That is not a deception. That introduces a lot of correlations in how yourself must move. That’s technically not correct according to the theory. That said, it works. So no one bothers it. It gets even worse when you have extremely large data sets, you know, like the way they do it, the way they have it in LLM. So you have 1000000000000s of tokens and those things are really light space. Something ID becomes really hard. It becomes really painful and people don’t really do it. So there is some, I mean, they shuffle things around as much as they can, for the data to move, as it is random, but it’s really not, there is a lot of structure in how the data is part. To what extent this is affecting the landing system. I don’t think it’s that well understood. I mean, there’s been some experiments and some people looking like what happens, small scale, you know, if I shuffle the data, if I don’t shuffle, there are things like this, but I don’t think it went further than that. However, if you really like, do not shop on the data at all, like if you’re not really attempting to follow in this protocol, although, my for example, you’re learning, and you’re showing all the images of zeros, the model, and the city, the one, all the images look to. and so forth. Theita was just not be like anything. And is this an extreme case, but you can sort of make the deastic stream and you’re hearing the same place like the model just doesn’t like and this is really what continer learning is trying to face. So the field of continual life, you say, assume your data is coming with a stream, you have no control over, you can’t shuffle it, you can’t change it. And that stream has a lot of structure. How do I change my learning process? so that you can still learn something out of that screen? And I’m not going to go back to the motivations and multip for reasons why you might care about this. But one of it is say if I have like a robot in the real world, the real world acts like that. I’m just looking at the world around me, like the stream of data is coming at me, I don’t have any control over it, and I have sound control, I can choose where to look and stuff like that, but I can’t really control the distribution as much as I want. But I still need to learn from it. There’s no higher interest. And this is kind of a setup that this would be learning, you would like to hear. one gave. Alright, so just go back to what I wanted to say. So this is contin like metal learning is a different topic. They end up being quite, I see that I’m, I think if we got to that part, I still have no idea how, how well, how far of those decides that I have. But if you get to the part about, um, incontext learning, like, life language models, that’s that’s a form of patha learning. So metal learning, it’s all about you, at, at test time, you know, as part of the input of the model, you get a bunch of examples and you want the inference process of the model to maybe the learning process itself and learn all the flights to solve the particular task. So whether learning is the idea that we don’t know how to train this algorithms. for example, we do HD, but we don’t know they opt you knowde not working inside the scenarios like material So if you spell of trying to figure out what is the learning algorithm, yeah, yeah, of my the learning and that’s why they use by that, is to go one layer above and say, how about we learn the learning algorithm? So that we don’t need to specify it because it seems we don’t know how to specify it, how about to try to learn from data? Which sounds kind of silly and it is silly to a certain point. Like you can’t always go up and say, like, okay, I’m going to mental level. I’m going to learn this and I don’t know how to this. I’m going to learn how to learn how to do that. Like at some point things kind of right? But in this, we metal learning, particularly kind of works, because, um, The, like kind of going back to that theme of what I was with anything, like, you know, things are not going to come out of being there, like you can’t. certain inductive biases that you need, you can’t extract them from the data. But when you go to this meta level where you’re trying to describe an algorithm that would allow you to learn how to learn, um, Basically, it’s a different space, you need to have to introduce at the end of this, any likely viruses, but because you’re in a different place, they might do more natural. It might come easier to do that, and sort of in practice, that’s kind of happening in my life, you know, there’s certain things that are kind of easier to define industry. So learning life in learning the way it works is input to the is actually a data set in some form and the output is a behaviour, you know, you’d expect that model to have They end up being a lot more connected when you go down into, like, really details and you actually look at algorithms and there’s there’s a lot of relationships in the subbed. Well, they’re different in a sense that metal, I mean, still works in an eye at least. so in principle, methal doesn’t say that you have toribution So yeah, so this is kind of the big difference between these things. Um, And yeah, part of the question was also about Jepa. which I might try to be later, but there that necessarily wayated to this. So Jepa is I think that kind ofur is pushing and he’s playing sort of above the world. I’m kind of very suspicious on that. I actually don’t even understand what life as much about that part. It’s really not that interesting. It’s very similar to other techniques that are out there, but he has his own obsessions and so Jepa is a self-supervised kind of learning algorithm. So what does that mean? So a lot of times So in sometimes there’s some things that yeah cons that makes sense. So where where do the fields started? So the field started, um, with this idea that we need to add that for sensors in order to do better. And like one particular thing that I cannot kind of use it very much on the department from the beginning, you know that, um, what these sort of deep architecture, talents, it allows you to learn representation. So this goes back to like what we were discussing, about, you know, the more difficult part about this and the part, like one of the hypotheses, or one of the theses, you decide, you need to have these theorian colour representations, that build on each other, in order to get abstractions, that will allow you to do sort of cognitive tasks. And then sort of as a high level, you know, philosophical kind of perspective on this. This is sort of a theme that’s very popular in Hokenji science and there’s a neuroscience and so generically thought that it’s really hard to do any kind of reasoning process. If you don’t have access to temporal and spatial abstractions, you know, sort of how we reason, we 1st look at the data, we abstract it, we do some concepts, and then we do something on the subject. Then there is some theory out there that says if you don’t have the subtraction snakes cannot happen. There’s no other way of doing it. As I said before, there are people who say you can. And I think one, particularly in the formative science kind of, it’s one example that people look at when they’re tried to make this argument against abstractions and centralised reasoning. So this part, like these things go hand in hand. So usually you have attraction and you also have centralisers, like particular, you know, you have the brain to have the reasoning. And then on the other hand, you have things like an octopike, I like that country, which it doesn’t have a centralised grade. So that’s sort of the maybe the most interesting thing about, um, about, um, yeah, the animal is that like somehow the the driver system is distributed in all the tentacles and whatnot. And they were completely kind of a few. But somehow the, you know, it’s one of the more intelligent animals out there, right? It’s vampire with or many other things. So there is some theory out there that people potentially do the same with a lot of attractions, but 90% of the community would say that abstraction like for sure. So deep learning, like really the whole point was about learning abstractions in order to do reasoning. So another framing of this is that we have these intermediate layers and what they do is they build the representations and they build the abstractions on top of which you could like a linear classifier or logistical regression. I mean, that’s how the system works, right? You you can have this of protection that gives you the right representation. I have the situation. And usually the way people think of this is that if I have the right representations, then everything else is easy. Like the hardness of solving any of this task. He is able to find the right projection, that gives you the right representation. Because afterwards, if you have that, and that makes your Italian any better way. Afterward, it’s easy. Like we know how to solve once we have the right instructions. So that’s why, you know, like the eye clear, to use the conference that was, that happened, you know, during the deep lighting era, it’s like replaying conference, so to say. It’s called International Conference of Learning Representations, because in fact, then everyone thought that the main thing about deep learning is the fact that it lines our representation and the only thing that we care about is our new organisation. I think this focus on representation learning kind of died off a bit. And nowadays, you know, people working on LLM, they would not talk about representations or they might not even be aware that they care about representation very much. But other definitely on top of mine for people like Canada including Joshua and others back in the day. And Jaffa is basically a mascot to line representations. And that’s why the is so obsessed about it. So the self-supervised kind of space, all it’s asking is say you have lots of data, but you don’t know anything about the data. So this is a like a density model or utity model, right? Because on data, you don’t anything about the data, but somehow you want to learn a representation out of the data. So if you want to have some process and then just all of that data comes up with a useful representation and then you can take this learn representation and apply somewhere else. I’m retraining my model using jet power and lots of data, I don’t know where it’s coming from. I get a trade model out of it, and then I use this model to do, I don’t know, I have any specification. And I just wanted to work because, you know, this model has already lied the right representation in the previous stage. When I go to this, I don’t need to learn the resation anymore. I only need to learn the Indian class 5 or so, you know, sometimes I want to even freeze the representation. I want to touch it. I just want to know that it’s a good information. In computer vision, that has been a big question of how do you learn representations from data? And Japa is one proposal in this space. There is a lot more other proposals. That’s why I don’t know why it is upressing with Jepa. There’s a bunch of others. They kind of look the same. You get these papers, one is better than the other, by Epsilon and so forth. But I don’t know how much I put on that. And… What what generally does is, um, I was just looking it up, so I will say something stupid, but I think this is correct. It’s basically a contrastive learning-based algorithm, but I’m fastly learning what you do is you take an image, um, and then is that impressive? I think it’s more like bad. but you take an image and then you take a protection of the image, say prop it or you shift it or you do something straight. And then you push this through like an encoder, which is going to be your are representation I need system. but you push this for you on that right? You get some latest representation, and then you ask this thing, these late interpresentations have to be the same, because this image and this transformant image that you’ve done, they’re kind of the same object. You assume that they have be the same object because you know we just done transformation. And, um, so this is kind of like one, one part and then I don’t know if, but I, there’s also many variants of Japan, Sonar, between Grand Arama, which one, yeah, electronized. But then you can either have like a capacity loss where you have another image that is different, and you say, okay, for reminding me to transform me the dish thing together, for my region, a different image, I pushing apart. And that’s it. And you keep doing this for many, many pairs. And then over time, you learn summer presentation. that representation and being relative robust. UC is not, there’s many techniques out there for some. I don’t really have slides of self-supervision. I thought it wasn’t necessary. I thought you I wanted to cover it also because it’s more of a computer vision kind of thing and I’m not really going to be a vision person. But this is sort of generally the mechanism that they use. I mean, there’s other Americanchanians changes use, but contrastive learning, which is you take things that supposed to be similar push the representation together. you p things that I meant to be different and you p back. It sort of by far one of the most popular kind of objectives and it has different flavours, but this is a generic form.’ been used in RRL as well. in a lot of places, because I really have the same sorry, I’m just doing lots of par, but in I you have the same problem. Nareal, you have a deep network. so you deep out. And you have your other objective adult. And what that thing has to do, it has to do with the thing. It has to learn a lot of information that takes care of reservation. And then from that representation, it has to learn a policy that. And the issue is in a world that the updates that you get from are a little objective are extremely annoying. that is because the data is messy all all of those things are nicely. So usually I rally seen as a very noisy organisation process for the organisation process. But, yeah, it’s very annoying. So because it’s very noisy, it means that it’s really hard to learn representations. compared to supervised learning where I have the signal like I found that when I’m li to learn very quickly how to protect the image and into proper represent If I take a rice net from French and I try to train to be enforced or or or do lighting or whatever, it’s gonna take forever. So in our folks have been playing with the idea, what if I have an additional objective, that is just there to learn the representation? And the RL is just there to learn the policy. So you have the systems that have a clear losses that you attached to your model, that are doing something like a trusty life or something like that. to just kind of how the system learned faster. So that’s for one choice that you can have that might make that change faster. So, you know, people have been playing with all these kind of combinations to certain extents, sometimes in the more popular thing, but you can imagine how these things come together. And the other thing that people have been talking a lot about it, and sort of that’s, I guess, sort of young influence and also, this idea of 12 models, which is if you have a model that can generate data from what you’re seeing, this is a world model, then, and I still don’t know why they think it’s that important, but the whole point of the world model is that it’s something you can use to plan. So instead of interacting with the world and posing hard, you have your own world model and you can interact with your own model. And you can do search in your work model and you can do all these kind of stuff. So the essumially some task, you can which is kind of like an interesting view. Some things you can’t really learn because they’re hard and really the way to solve them is to do some kind of search. Like, because, you know, like it’s you’re never gonna have the right structure, so you need to have like alpha go work like that. right not your policy that will way what the next uh key. But, uh, after that you do MCTS, which is some kind of search, right? Like you basically try many things and you try to expand them and then you need something that tells you like, if you come up with this kind of trajectories, are they, you know, like, in the world model that will simulate the world to tell you sort of what would happen. I take this action. And this is sort of another place where this Japati comes in, but Yeah, I just got back to the original question. I don’t I don’t see why you think it’s such a big thing. I mean, I definitely agree with him that we need to move away from scaling at some point. I mean, even if staring is a solution, like, It’s just not healthy for all kinds of ways. So if we know or if we think there might be another solution that doesn’t involve scale, I think a bigger spot is for free chug, but the community, you jump on that and work on that. So that’s like, I think everyone in the community usually agrees, like no one likes scaling except some groups that have the ability to do it. And they use it to type up their monopoly or what’s going on, which is fine. But, you know, research is in general, there will be priority to have an option without failing, and everyone agrees with that. I just like it funny when people try to find certainty solutions by it, but they actually end up with including. Yeah, and these kind of places about the Japa, which I really don’t don’t see the Japanese scale will go out. So it’s not like Japan body is going to work last year. So that doesn’t really feel like the chance the problem is kind of make. I have a question about that representation, learning. Yeah. Why would the person in this or until they get as many of the medication, their name, as as far as I understand, we don’t compare the actual factual differences, but we focus on their representation. So for example, if I’m using muscularly included, I’m comparing the actual process. But it should end up doing the same things, because, uh, threatening space in the photo-included is going to end up having the presentation. Uh, I, I have all my history with it. Yeah, yeah, so I guess your question is, you have systems like, like, jump out where you end up, comparing things in relationships, and you have reconstruction-based things by myself and Coder, where usually you take some image, you encode it, and then you decode it, and then you compare something to people. So, um, I mean, we decided to start relative churches that you have, I think people nowadays tend to believe that reconstruction base doesn’t work. So the biggest issue you have, reconstaction is kind of the 1st technique that people came out with like they have all the en involved that probably from the earlier and days in the field. I think there’s some, I don’t know exactly that you can, but like because factions aren’t very reasonable in other places. The reason why people are slowly moving away from it is basically the distance back in the in the interspace is really problematic. So, like, use newspaper error between pixels, but in computer, you can, you know, you have lots of paper that show that newspaper error is not reasonable. Like, things that should be very far apart, become very close in in the Easter era, and things that should be very close, are very far apart under these metrics. So because of this, um, at least on images where a lot of these research are done, um, reconstruction doesn’t quite as well. So you have when you start learning, you know, the lows start going down, things are okay, but then once the lows go below some number, like it is not really clear that whether your knowledge is moving or not. Like unless you can really construction going to zero, like your improvements are now sort of the way you want to. So moving to these kind of businesses in the latest space, usually the assumption is that the latest space are more semantically meaningful, like any movement in the later space is meaningful in some way. So when you do news error or like any other, other sort of southern distance in India, like the space, um, that sometimes the presenter, it just works back, right? Like VR problem. And I moved to NLP, and I say, where is more of a representation learning? Because my own average is not, wasn’t included. So I’m maybe maybe… I haven’t, like, I’ve heard this. I’ve heard the, like, without the plane, I have to progress. I’m just going to be here. Um, I’m actually, I mean, I might need to lose something, but I don’t remember how bad it is. But I think it’s, there is no deep, the difference between vert and SA, ChatGPT or Gemma or any of these homes. They’re both doing reconstruction. I, like, yeah, I think the same market, knowing approach, that’s similar to mostly being good or general. aggressive way. But, like, even in parachute, still, like, classify the total. It’s not it’s not like you’re doing anything other. So it’s not like they’re both reconstructing the eucospace and they’re using the same… I mean, the difference is that in the token space is much better behaved than the decent place. That’s why, like, um, you know, language or energy move way fast, right, then vision and vision is a bit behind with these things, even. But that is not the difference. I think difference between Burton and transfor the one is aut regressive, the other one is not. The other one is doing the math thing. Um, To be honest, I think, um, I don’t know if we have or if we sit down at a time enough experiments to understand the difference between author aggressive and sort of the math version. So I think there is this assumption that, you know, like also we have to work better. I don’t know how well but understood that is. Like, the point was the open eye came up, which has BT, and then immediately started skating it. And then, like, when you have models that are much, which are trained on much more data, you’re not comparing that most level anymore. So with a bad work, if you buy the singers, tail, train on a similar data and all of that, I don’t I don’t actually know I don’t even think that the paper like this, but there are other reasons to like, um, um, you know, the relationship version because it’s closer, and language feels naturally because, right, you want to, just delete one word at a time and you go for it in a sentence, and that’s how you write things. So I think it has a very natural vibe and I there is now a tre my mine I talk quality using nutation into the language. And it’s not quite, but it goes back on normalising the sort of non-puzzle things. So the diffusion is generate a whole text of ones. And it’s just kind of rather a few words in and really kind of, so yeah, so it behaves a lot more like words, right? Instead of just starting one word at a time, you just, you just start with some random words, random positions, and then you still work left and right and so forth until you get to the right place. And, um, as far as I know, on the text diffusion side, um, it is to a point where a lot of people argue that it works equally well with a non-authory test of things. So, um, and then there is a question of like, is it any better? Like, why would you do the fusion instead of, and there’s, there’s some reasons why it’s better for some reason why it’s worse. I mean, I don’t think any deployed system is the tax decision to generate tax, like all the big models you’re interacting with are all very aggressive, but on the side side, there are papers and there are people working on this and they’re claiming that at least in principle you can get the same quality of the property. And if you need a lot of money e for it. for that. Yes. Can this be connected to explable AI? The fact that we are going to learn, like, a good abstraction, which can lead to a better understanding of the of the structure or… Yeah, I, um, I mean, at the high level, yes, and I think 11, um, if you, if you learn attractions, yeah, then that, that’s sort of one way to sort of boot your, your explanation because, you know, you, like, if the model really learns subtraction, there are semantically meaningful for us, then it’s very easy to say, I’m going to look in the model, I’m going to say, okay, this layer, the model is doing this, you know, trying to contract this attraction, then it’s presenting objects, and then now it’s reasoning about objects, and now it’s taking a decision. The big extent sort of the silency maps that we are talking about. That’s why they’re trying to identify. like what are the abstractions or what is the model really trying to do at this unit in this layer of this model? Um, And I think, again, I’m not an expert on explainability, but I think a lot of the literature and rely on this kind of things, right? So they, like a lot of it is, it is about assigning sort of semantics in different parts of the model, either to felix, or either things, they then do some projections, or they do some other tricks, they make it less noisy or more robust and so forth. So, um, At a high level, I think it is true. I think the, I mean, the only caviare that I keep bringing up for the ability part is that, um, It does some kind of abstraction, like from even from that picture of holding the space, you can, you can think of that holding as basically abstracting the space by compressing it by, by, by putting things together. For those attractions, like what I’m, what I’ve been more suspicious about is to what extent the semantics were attached to them? what extent that holds. And this is because as I said, you have a son said examples and all of this stuff. So, um, the issue is for for us, when we build that fraction, they’re usually very well encapsulated most of the time, you know, they they have problems. be only one thing. So for neural networks, it’s like that support for a unit ends up being sort of infinite somehow. Like, because you have like, okay, you see this response to replace it, but it also responds to regards to noisy and also responds to other stuff. and it’s really hard for us to put those things together. But I think by life, this is sort of how the things are being used to you, but as some semantics and you kind of go through it, you know, that might be Santavia, there’s a way of breaking, there’s explanation, but You know, that is fine. Like, it’s still kind of, you do the same kind of makes sense and you still can guide someone to kind of think that outfras going on. And if you have a fixed input output, it’s actually correct at some point, you know, that’s exactly. So the issue with these things is usually not, Can I explain why I made a decision for this particular generation? Usually the problem is, can I claim that he’s going to do the same for other similar images? And that’s why things usually kind of break up. Can I generalise this to say that the mother will always behave this way? And then that’s a little bit more triy. But for even even you can definitely read this these attractions out and say, wow, this is what you’re doing is looking for you. Yes. When we talked about other and, uh, and, uh, digit of algorithms, uh, right, and the city is that the model is going to learn the representation of the environment. So, 1st it’s gonna learn, uh, in some of the obstacles that we have, the, uh, let’s say, worlds, weather. And then, uh, the policy gonna learn that, okay, he would have to avoid the world here. we have to do that, he will have to move.. So, uh, if you use it within our, so, like, the Jempai itself, it just focus on the on the representation writing. So it’s just gonna learn, like as you said, like something about the environment. So usually, I mean, if you apply to computer vision, you’ll get sort of a similar kind of structure that we typically see where 1st layer learns on the board filters and then you learn some kind of part of object and then sort of in the, from the representation, for example, you should be able to linearly classify different objects. So you somehow in that representation you have encoded that this is a wall, you know, this is an enemy or whatever you have in the game, this is this and this and that. Uh, so this should be somehow represented, um, I mean, the representation can be distributed, so it doesn’t need to be like, I have a unit that fires whenever I see a wall. could be like multiple units that fire in a particular pattern for a while, but they’re linearly separable. I mean, that’s sort of usually what Java is trying to do. is trying to make representations that are linearly separable. So you can use a linear layer to decouple these things and do whatever you need. Um, and that’s all that Japad does. So Jabava doesn’t have the other components, like just learning the representation. And then on top of it, if you want, you can put like a linear layer and learn your policy if you’re doing a route. And then policy will be like avoid PVCDs and that. But yeah. Yeah, but, uh, If they are a problem, I’m not, I only have partial view of the vitamins. So, that’s, uh, how, but they have a, they ended organisation, so that, you know, like, whatever, they don’t have the lions, stuff. Yeah, so so the Japa will, or if you just try and sort of jump out on, on, on, on the partially observable thing that you have, you will just learn how to reason about what you see at that point in time. So what you really need is sort of a system that has some form of memory because you need to collect information over time, right? And if you have multiple partial views to make sense of where you are in the game. So in that case, I mean, Japal in its own is not a current model. It’s not it doesn’t have memory, right? It’s just sort of your typical continent style thing, right? You just go forward. So for partially observable environments, you will need to add on top of that representation, either Transformer or on RNN or something that has temporal extent that can integrate information over time. Otherwise, you can’t give, um, there is some, like, the only Javier that I know, because I was playing with this at some point, like, we are trying to solve mazes, to think of that possible, and you have only a partial view of the maid, so you can only see, like, a few squares around you. And I felt like there’s no way of solving a maze, unless you put together all the partial views, kind of get a sense of what the maze is and then find a solution through them at least. But if you run around on this, it works perfectly, well, without any memory. And you can solve it. Uh, and the reason for it is is because there’s a simple solution when you guys really know it, which is always, always go there. Whenever you have intersection, you always don’t have. If you’re amazing, you always do that, you will find the exit. I don’t know, it’s just sort of a give. So you only have yet is sometimes partially observable past by look hard, but there might be some suboptimal solution, but good solution that exists there where you don’t need to understand it. So you need to look out for that, but otherwise you need memory. But, so, you mean, for digital, you mean, you mean? No, the algorithm does not need. I mean, it’s not meant to deal with the partial observability problem. Why? And, well, for the difference, you don’t believe it is also free, new data, like you presented, you have information, and then not there about the new kind of equipment. No have to use fast pay knowledge that you already have about this image. to predict, like, I guess we’re talking about sort of different kinds of memories here. So I guess what I’m talking about is a really important partial observability. A team friends family is never seen entirely much because he glimpses and you work in the environment. So you need to like put all of those glimpses together in order to know what you’re doing. And you feel like you’re almost talking about this over the training set. like So in some times you Japai is compressing entire training. So it has actual training set. So what Jepai is not going to be good at is, um, is if you need a memory at inference time, like if you instead of seeing a whole image, I can only see a little square and then I have to move my eyes around, kind of get to see everything. Then I can apply it there independently to any of these glimpses and it’s going to analyse the gate and dummy what’s in there. But this doesn’t mean that I can, you know, I need something to put together all of these little pieces. It doesn’t need whatever. like, it’s. No, because not not the task trying to do. It’s only trying to give you the answer about this particular thing that you’re looking at. But I’m just saying in, if you’re, if you’re thinking about an agent in an RLSA, where if you’re thinking about like an overall AI system, the defence system needs memory because that system will literally integrate over time, post search things. But Japan is not meant to do that. Like that’s not their best job. interesting. Okay. Well, okay. What’s the time? So I’ll go tracking the slides. Wait, is Let’s do the 5 minutes, right? arriving no, is right. Okay, okay, okay then bit. Okay, so I already mentioned this stuff, so we were we were like, yeah. need something. sorry. Yeah. Yeah. No, no, no. So then, I didn’t go earlier, but you got any problem. One is from the architectural…onents there’s no proper solution to the field at the moment. It’s just ideas and there’s upgrad them with better words. and they approach it from all of these directions. In some of these are near always fire. So, couldn’t they, for example, in the last week, without, um, elastic weight presentation, the AWC? No, the main distinction from… I mean, there’s region goes so far, but it’s magical parallel, that the, the, the, the, the, the, the, and we’re trying to learn from that. Um, But yeah, yeah, yeah. It’s both of them. I prefer the learning. I preferred that critical annal, but there is plenty of work that are looking purely from an arcade programme. And, and then, then try to solve it that way. I’m not sure that I’m gonna be okay. So, community, so we started a bit of motivation, so we said it’s important because HR is everywhere. And we have this issue that a static system is not good enough anymore, right? Because things that change change all the time around us, which means that the way this is been done for at the moment is you have sort of this regular retraininging of the systems and redment, right? It’s like sometimes the, you know, your ChatGPT, Gemini, whatever, you know, needs to be deployed. Not necessarily because they improve something on the architectural side, but just because it needs to keep track with the world. You know, if we still would have sort of the chatpot from 5 years ago, then you do not know about events that happens since that. Yes. Is it retrained from scratch every time? Yes. So there is this law? Well, yes, yes, yes. it’s always driving. So you have, usually, I mean, every system is different than my numbers might be wrong, but usually my take anywhere between 4 to 6 months to train one of these speed models from scratch. that happens on a huge class service. That’s another from huge. So they started regrand and then after that there’s a few months of like post-training and they do all kinds of things and then they destroy it then. I mean, so far there’s always been also like additional changes, like making them a little bit bigger and changing something here or there that’s been exploited in the meanwhile, but even if people are not explicit about this, part of it, it’s also the data said we have updated, and particularly gets updated with new things that happen in the meanwhile. So if we’re able to make the model adapt instead of like, if we’re able to make it make use of its pre-existing knowledge, then that would save a lot of money, right? Yes, that is the premise of the field, of the media learning, that we could make this work. there will say lots of money. There is a sense in which, I mean, to be fair, the big companies now are quite interesting. But there is a sense in which the fact that it’s so expensive to pay is an edge that they are exploiting. Because if you’re not interested training from scratch, then a lot more people could do that, and then you’d have a lot more competition. So there is this kind of a mode that sort of is fact that you need so much computers or anything. It’s sort of in some sense comforting for some of these companies. Like if you are to do now a startup and you want to compete on this particular fraud, you need lots of money just because you need lots of computer. Like, assuming that you don’t want to innovate, you have the recipe, it’s just about implementing it and you can implement it by chips somewhere, you still need a lot of money because there’s no way around it. You need lots of computer training. But then there’s still a limit to how the big companies will perform within the limited context, right? Yes, yes. And I mean, there is lots of things that people don’t like in, so for example, the way it’s working right now, you have, I think they call it a hero run. So you have the hero around, which is like 4 months, whatever, 6 months, retraining of the model. You can’t afford to do multiple of this. You can do only one. So you cannot hypertune. If something goes wrong, that’s it. Like for that year, you’re gonna be set behind, right? If you do your run and it fails and some other competitor does, they run and they run words, you know, people will see that, but the next iteration when the model comes out, and then you want the switch, and then you fancy changing, it is behaving better. So there is a lot of pressure on this on one particular run that it has to go well. And obviously, I think, particularly, the people working on that, I’m honestly happy. They would like to have the weiggle room of saying, I want to try this. if it doesn’t work out, I can rerun it and do something, right? So it’s not necessarily a comfortable place to be. I would say. It does create sort of this artificial note that is just few people who can do this. But at the same time, even for those few people, like if you call the only one person who produced that, one in one company, then it would have been fine. But because there are multiple companies, it’s not a very comfortable place. So that’s why I think, you know, people would like to have either more efficient learning or, I mean, it’s not always about continual learning. Some people would argue that maybe there are other ways to make this be cheaper. but you know, they would like to have an asset. It’s a very good problem. So I think there was a time that you came up music. It’s not that cheap. Yes, it’s cheaper, but it’s still not that cheap. I mean, if you want to try to train deepseat model. So, okay, it’s also about scale, right? So I don’t remember what scale GC is going for. But, like, if you try to train sort of in the 100s of millions, which is sort of what the big models typically are, 100s of 1000000s of parameters, is actually no matter what you do. Like the deep sea recipe is not as different from the standard recipe. I mean, this is one thing that is, in some sense, sad is if you start looking into it, like what different companies are doing, it’s almost the same. That’s why there’s all this talk about the data matters because it feels like maybe the data that they use matters even more than the other details. I mean, yes, Dipsik had some tricks in there. Pretty sure all the companies are using those drinks by now. But these recipes are not that different and how expensive they are. It’s also not that different. And like the issue is really like when you start scaling. Like if you, I mean, there are startups and companies that work in sort of in this range of like, I don’t know, 9 media into 37000000000 parameters and like that. But this really is considered a small model. This is not considered as a, and these models are limited. like in terms of the behaviour. Like if you interact with a 99000000000 or 2070000000 model, you would set the difference from interacting with like a few 100000000000 model size. That’s why people really like this here. So really the problem is when you go, the exercise are large. And then once you have the extra light, you start distilling it or making smaller version. I mean, obviously the model is being served to you, say, on the phone, it’s not going to be the 100 VM model because you can’t use too expensive. So, what you’re driving on your phone is gonna be a smaller amount, like a, like a, I think on the phone you can’t have more than per wheel. If it’s running locally on your phone, it has to be below for video parameters. Otherwise, you’re just not going to see it. it’s not gonnarup way to be. So, um, so in terms as a lot of the seasons, you do, as I think racking weight are a lot smaller, and, you know, many people could because they have, they train them. But a lot of time the smaller models are also trained through distillation or they’re regularised with respect to the beaver model. So having the bigger model helps because you’re going to get much stronger, it’s one amount. Um, and and it’s hard to use somebody else’s bigger model because, you know, they will be able to tell if you’re wearing their model enough so that you can retain data for yourself. Yeah, people need to type that. Like, I mean, you’re not going to have, it’s going to be very expensive when you like it for already that much if you model. projects. But yeah, this is roughly… I think the scale of things and the impacts. All right, so. So we talked about reasons for one reason is a world keep changing. The other reason is The idea of simulating sort of stationarity, so smoothing the ID is expensive and it’s computation. So one question that I think is pretty problematic is this of coverage, right? So if you say, I want to train once and just be able to interact with the entire world and know everything, then this time when you train this point when you collect the data, you need to have this coverage to cover all possible cases that the model might be asked about later. And this is not an easy question to answer. It’s, I mean, people do things like, okay, we crawl the entire internet and stuff like that. But there’s no guarantee that it’s so covered everything that you want or even if it covers everything, it might be completely imbalanced. It might be that there’s a lot more data about X than Y and somehow you end up not learning why because of imbalances and all this stuff. So how to collect it, how to construct the data, to train this model, is not treated by any chance. There’s a big problem. I believe that the way being told is mostly like realistics and text. There’s like sequence recipes, like different data sources, these data sources have different weightings, how they train the models. There have been obtained more or less impurity. And if you if you try to push people to tell you like, why does Wikipedia has a wake-up point itself, and this is a weight of what that, they would not necessarily have a good reason. They say, like, this is what works. You know, this is our secret recipe. This is how it makes things. If media has this way, I don’t, we have this way, this has that way, but then, you know, there are other sources, and, you know, we grow this, and then you have all the filtering that’s going on, where you have pro data, where it is toxic, or whatnot, and how you define those filters, how you detect, whatever you want to filter out. It’s all like unistic and it’s all sort of part of the secret source. But it’s a really high problem, right? If you end up in a world where you don’t need to have this much pressure on having comprehensive data of everything, that’ll make that much, much easier. And material learning is kind of promising that the other thing we talked about is performance and how we evaluate systems. And again, like a scale evaluation becomes hard because now you start by being basically the train, the test set is not an independent, unbiassed sample compared to the train set. And you have this kind of contamination going on because it’s really hard to even know what’s in the train set, to know if sort of the the thing that you’re putting in an evaluation, if they somehow not leaked somewhere in the internet and been collected in your training set without you knowing, you have all of these kind of issues. The other thing, it’s hard is that we also now start to deal with the skills and it’s not clear how you would measure, whether a system has a skill or not. There’s also a question about human preferences, which are not really easy to put in a better part, right? Whether it’s an educator, other than the others. But overall, like, um, By 4 months as a metric, let me put it in like the P have been upessed for a long time with the idea of bagmarking. You kind of borrow from computer vision because of internet. But for a long time, the way the field works, we have subnumbered. And all you do is you have to make the number go down. You know, whether that’s the or up. The accuracy, animation, I think, on, that’s your number. And nothing else matters. You know, you can basically summarise your entire model to a single number and you know that, you know, this number has to go up or down. And the issue is as things are evolving, you can’t summarise things for number everywhere. Like, you know, you can’t just say, I’m going to attach a number to my LLM and the only thing I care about is this number is lower than the other number. There’s like different ways you can evaluate and you can’t just average those things together. They live in different faces that sort of have different properties. So, um, and and that’s sort of where we are, right? Like, benchmarking, it’s all about being able to take this table, where you have models, and you have a number, and you can you can see which one is lower. And this branch marketing, which is sort of so inherent, you decide, you well, it’s not working out anymore. Um, The other thing that I wanted to say about motivation or why you want to move away is when you start on this road of ideas, I did learning, you’ll end up where we are right now, where, for example, when people are taking these models, you can read one in one big model, right? So then you’re kind of forced into this one size speaks all kind of shinking. Right now the expectation is I’m gonna have, my next model is gonna be good, coding, you’re gonna be good at this, you’re gonna be good at that, it’s gonna be good for this culture, for that culture, for anything. So this is not optimal in some sense. And actually, you know, depending on what your goal is, like once I speak so doesn’t, doesn’t seem like it’s sort of the right way to go about it. And of course people do create specialised models for different things, but I think it’s far away from saying having specialised models for individuals. Like, you know, I have my own kind of personal experience with my own both and it’s kind of adapting to how I interact with it and stuff like that. I mean, I think that is not really there. I mean, it is similar to a certain degree, but it’s not fundamental, part of the model, the model is always the same because, you know, this is how things are been trained. I don’t know, I… Let’s do the break and then I’ll come back to this because I think it’s a bit more interesting and I don’t want to rush it. So let’s do a 5 minute break or 25 minute break and then we’ll come back. and stuff, uh, work together. Yeah definitely. question. You know, here, good sir. So… But enough… Because I can say, do you say by a person? Do you say something as well? Yeah, somehow we are… What about… And so far as I had… I’m happy with the… The last one… And the doctor… I can… When I’m too lazy to try to come along. TSP, LLP will work, so… No, no, no. popular reviews. This… Ah, because… And I’m submit on time across AMZ. Yeah. Alexa, I can finish me. So, mission. the extension of a foot… on that… No, it does. time As long as they maintain in the section, the goodness, monkey is a headache is a lady, they could pass my distinction. It doesn’t matter at all. Astronomy or… What is it? I mean I see what you mean. I like the reason why I don’t do not just dont par like I don’t need to 1 andose my something and then. Yeah yeah yeah I see Yes, I haven’t… It’s getting me.. Anyway, it’s not a show. Right, then… No, but I think, like, it depends on my commission, right? You are the most… I don’t like why they check them on your list. Oh, yeah. Yeah, yeah. I’m worried. I say, I said, if you were you… I was going to, no, no. it doesn’t matter you can at this point doesn’t No, no, no, I don’t care. So before I I think that now should be available test. So there’s so much leave ar. Yeah, if you have any feedback on a, let me know that we might attack co then and take the 2nd homework. I don’t know. So there’s any – Yeah, I mean, if you have questions about it, I mean, now, I mean, where is the popular anymore? But much there, I will you guys to at least read it, try and get a sense of so that you kind of plan ahead and and. So I think it’s due next Wednesday I American. Um, and it’s a little bit of an edit, sort of totally worried about it. right? As I said, some of those things I’m going to say, none, please dance. You just need to give an answer. and see what the way that goes. Um, But at least I think about it. You don’t have any good things for, and then you watch. One’s actually, you should go away. Okay, so we talked about the demontivations? I’ll jump into a new one and I want it, I, I, I came to suicide because uh, this is uh, Rich Satan’s thing, continual lining, and, you know, he has a lot of influence in the medical, in general. So it something is I don’t know, people talking, the Godfather, Barel, or whatnot. He’s the guy who wrote a book on RL and he’s done a lot of RL words. so he’s sort of like a very deep name one learning space. And he now has a research institute in Alberta, which is a lot about continuing. So like they basically decided that this is the topic, you know, to worry about. And he perspective that I have some issues with is this perspective as a world is bigger than the So this is his right name. So, um, the way this is friendly saying, well, having an agent that can learn everything about the world is impossible because the world would always be bigger than the in the agent. And one way he’s framing this is saying the world contains other agents, including yourself. So the world will always have more views of information than you can score because you need to score yourself within that. So this kind of philosophical kind of thing. So he’s expectation the world is my favourite an agent. Therefore, you will never be able to beat it. You’ll never be able to feed everything that’s in the world to in the world. So the best you can do is you can learn sort of things locally around you. So you you kind of keep thinking totally and then as you as things move through the, as you move to the environment, as things changes in the environment, you need to reline, right? Because from, even though, let me, let me change, like his point here is, even though the world might be starting, so even though, say, there is some type of distribution that will design everything, then, you know, there is so, so if you have that distribution, you could put things in an ID, right? This distribution is so complex and so big that the best you can do is to feed things locally, and then us other things evolve, like you keep unlearning. So from the user perspective, from the agent perspective, things always have to look non stationary. Because the user does not, the agent does not have the capacity to represent everything that is seen in the past and it’s seen now. can only represents sort of small portion of it. So the only thing that you can do is that you can sort of keep learning all the time. I don’t know if this stake is makes sense to you. But this is sort of the framing that is very popular in our. So this is kind of’m framing you’re finding a lot of other and papers. this perspective that the world is bigger. There is some… There is some interesting word from Kumar Athols. This is when Roy Paper, you can find, this is a very mathy paper. So Ben Van Roy, Ben Van Roy, is um, like theory and also likes probabilities and statistics and and invasion stuff. So expect to have all of those flavours. But it’s a theory where they show that in the limit, any learning system has to, to, to, uh, to place continual life. So it’s kind of relying on this world is bigger than the agent, and it takes a very information to aritical kind of point of view and argues that, you know, there’s always going to be more pizza information that you need to learn about and somehow there’s not going to be anywhere around it. So, like the details of how they do it and I mean, it’s sort of a nice paper if you’re into this other stuff and, you know, I’ll definitely recommend reading it. But it’s, you know, not crucial. But the point I’m trying to make is not only is this a very, like a perspective that’s very popular in the RL. It even has some very strong theoretical underpinnings, right? So they have these kind of work and they’re smart, and and then when Roy work or other works as well, they are really trying to formalise this from an information point of view and show that under some other assumptions, this always has to happen. So you will always, from the learning system perspective, they will always be foundationary, even if they’re actually stationary, because of the lack of capacity that the, um, the learning system has, um, this will translate, you know, kind of nauseationary, actually, you know. Um, any – any – OK, so I cannot tell you what I don’t like about this. So what I don’t like about it is that it’s one of those things that it is true. and in the liric, this is gonna do what’s gonna happen. uh this sort of theory tells me that. But I don’t like about it is Because in practice, this is not what you’re facing when you’re trying to deal with this nonsessionity. So, um, there is a few themes that are complicated here. So 1st of all, we like, We don’t know what is the capacity of a learning system, right? We’ve talked about this and so we’re very bad judges of looking at the address and saying this is how many metre information can these things go? And that’s why we had, you know, surprises where people believe that, you know, the restaurant cannot be able to memorise the image net, and then, you know, they can memorise the visnet and so forth. So the models we use are really high capacity. have not actually know how to measure that or how come expect it. If you had a bit that I, that I, and then on the other side, if you look at the world, the world might be complicated, but acting optimal in the world doesn’t necessarily have to be. So this argument that like a world also includes me or other agents, I don’t fully buy it because there is fine, but I don’t move to model everything that goes for someone grain in order to be able to interact with that person. Like the policy might require a lot less to be some information. So obviously, if you’re trying to create your world model, internally, that world model would be imperfect, but that is partially the point of the world model. It doesn’t need to be perfect. Like it basically is not clear to me to what extent being able to model or eat everything about the world is even necessary for whatever policy that you have. So these are 2 type yards that I have that kind of makes me a bit more suspicious. And maybe the part that I hate most about is, I don’t know if I have another slide in, so I don’t know. Maybe the other the other thing that makes me most suspicious about this is it’s basically the effect that this happened in the other world. So the other world, I decided that being, is that, you know, continue learning what we used to be, we used to go to construct these very controlled settings. where you learn that one, you wrote the that to, you go to that, right? Um, and then there’s a lot of visual, this kind of construction that people don’t like, but there it was 3 years. Like, okay, you’ve learned this, and now things are going to station it because you switch to have to do in B. At some point, you know, people are saying, you don’t have to do that. You just need a very complicated environment. And you put the agent to learn in that competitive environment, yes? Um, I was just asking, does this idea have a, a, a relationship in entropy from information theory? Yes, yes, that is sort of what Ben and Roy is trying to import in that word. So it’s really connected like you can you can take sort of this particular perspective. And then you can prove on there’s some other assumption that this has to happen from an information point of view. Then I forgot exactly how the paper works, but you can open it up and like if you like this page, you’ll easily see their sort of the formula, but this is basically what’s happening. It’s sort of like the amount of pizza information that the agent needs to to to to store is more than it can have. better. Um. Okay, so the issue that I had with this, also sort of this tenancy that I’ve seen in the fields, where people were like, let’s make in a real environment is very complicated. And I put my agent in there and then it has to continue learning because, you know, because of this hypothesis, you know, that environment is complicated, you need to learn hypothetically in the city, so it will have to simulate it here, I will. And I didn’t like it because we’re really bad judges of these kind of things and I’ve seen it over and over again. Like, for example, in D-Mind, at some point, we switched to like 3D environments, where they thought, oh, this is going to make their wealth problem, much faster than a star, it. Anytime out not to be much harder. It just computation is more expensive. but like, you know, whether the environment is pretty or 3D doesn’t necessarily make a polishing or comput. So it’s basically I think it’s a very nice hypothesis. I think it’s a very good rounding of the field. But it has sort of some caveat where I think the, in practice, when you learn in a notationary setting, what you’ll notice is that the issues that you are facing is not the fact that the model was too small, but it has more to do with the learning dynamics. The way learning works, it kind of breaks what you’ve learned, and it has nothing to do with capacity. That’s has something to do with the agent sentence of itself. Like, basically, this is all our only my worry about this. I think this is a very cool proof of of existence of the continual learning problem. It basically, at least the math a bit, the information already, it’s kind of like a proof that you can’t avoid a continual learning problem. At some point you’ll have to deal with it purely because of his property, that the agent is smaller than being the world. Well, like what I don’t like about it is that the problem that we’re seeing in practice when you try to train systems have nothing to do with capacity. They have a lot to do with the land algorithm. another aspect of the the process, but they don’t have anything to do with capacity yet. Why worry, when this came out and everyone was talking about the left and right, the people end up focussing on the wrong aspects because they start worrying about capacity and how many models could have more capacity, and that wasn’t talking the problem that we were actually seeing in practice. So this is sort of my people because you know 30 But yeah, it’s like maybe one of the more established pers factories of in theining. Um, Just kind of continue, I mean, this is enough, uh, there are other reasons why you tear up, material learning, we’re discussing something similar before. The sort of feeling is unsustainable, you know, like this large system that we’re training, now there is an ID, because we have lots of energy, that has a huge impact on the environment. It finds things not equitable socially. You know, not everyone has access to the system. No, they want to play with them. So there is a chance, I mean, this is not a given either, right? that you have this adaptable systems that things will be differently, and you’ll not need to retrain this system called French, or you could alter them more meaningfully, sort of in a sustainable way. Um, Another reason that, uh, continuing as a here is quite important is because sometimes we need to online. Actually, I’m not talk as much about online either we sure in this life. But on learning is sometimes it’s just continue aligning with the minus thing from. Continue learning. It’s really about I’ve learned something and I’m learning something else and I don’t want to forget or I want to improve on what I’m doing in based on what I’ve had before. Un learningning is saying, well, I learned I learned a bunch of things, then I want to delete part of the things that I have like. And the methods tend to be questionable, yeah. I think it would be, I guess, as far as I’ve come across it, like American history, the beginning, the degradation model is performance in general, because it affects the home situation builder, what? So, I mean, it depends on the kind of underlearning algorithms that need the views and how you purchase it. I think as appeal, this is, it might be pretty young. And some of the ways I’m aware in this space to me are just feel like hacks or things like that. So like one popular as far as I know, one popular underwriting algorithm is you take an ascent step in a direction of the data that you won’t forget. So, you know, you have some data that you want to forget, the computer gradient on data, where you put on the data, and then variation will show you the direction you need to move in to learn this data more, so to say. And you just inverted the sense of the guy at the moment in that direction. And completely by some, okay, like, okay, I want to forget, I just sort of walk backwards, right? I myients forget it. But if you think about this, this is a divergent process. It doesn’t have to provide anywhere. Um, and you can see this because people that do this, they, they have all kinds of great statistics of how many steps you take, how to stop, and things like, because it’s like nothing that really, you know, like what you have made and you said, you stop when you converge. like, you know, you don’t need to worry about these things. I, I really don’t like this concept of, well, I can go to the radio. Um, Yeah, so methods in the online space, I think that at the beginning and there is a lot of work to do. Um, like, um, one thing that I haven’t seen that much, and I think for the discussion with a few of you guys, yesterday, is, um, like, we, we could hear it a lot of time, you sort of identify the subset of parameters with one or back, and, you know, by, by just looking at the speaker, nice and stuff like that. And then you protect the away any element of the gradiients that I think other parts of the model, between you shouldn’t touch. I haven’t seen a lot of those kind of methods in learning. So one thing you can imagine is you want to make young learning signal, even if you’re doing this ascent stuff that I don’t like, you want to make it localised. You want to say, like, it’s not going to be applied to the whole Mikita. I mean, the 50000 family. As far as I know they don’t do that. They just completed but at all that good. But yeah, I mean, in general, I think that the truth of it is learning is a very young field, so the methods that they have right now, they don’t work very well. There is sort of a they are. So I think if you want to jump to the base, this is probably still a good time to jump into it because it’s like really at the beginning. So almost everything goes. There is also a different concept about learning, which is also a bit harder to, and like, for example, there is a learning in the sense that, okay, so there is a feature of me, my data set, and I want it to be removed. I just want to remove the feature. So this is more sort of like on the memorisation side, whereas, okay, there’s a pack. memorised it by myself and I just wanted to move this back. And then there is sort of I want to unlock a particular skill when I want to unlearn particular concept, something that’s a lot more vague. And these are very different questions. And you’ll probably need very different algorithms and very different ways of approaching it. So there’s all of this kind of ones that that is, is, is sort of useful to know. The other thing, and you probably will notice, this is my slice. The diner learning, typically, a lot of the literature is on, on, on, on, on, on, on systems without memory, like MMPs and so forth. Um, I think, yeah, learning literature, because it’s just like now, it’s focussing a lot more on Transformers. I might talk about this again. When you have memory, Continue, I mean, and I think we have a lot more complicated. Because now you have multiple places where information can be support, and you might also have multiple kinds of learning mechanism happening in parallel. So 11 mechanism is sort of the region scent, and the waves are the place where you start information. The other thing is sort of the memory stuff of the system, and that can be sort of somewhat more explicit as it is in the context of Transformer, or it can be more, which is, like, an additional centre, very family or effort. And we have, sort of, like, metal learning effects, like, something like happening as well. And it’s a user of complicating how all of these systems interact with each other. And just to put it out there, like I just think there is not enough to turn this. Like basically we don’t necessarily have thought yet deep enough how all of these things interact with each other. Well, I’m just, I strongly believe that there is a lot of interesting questions that you can ask of, like, maybe any of the standard questions have been asked or online, you got the female aligning, and ask sort of how the memory component of the system interacts with this. And sort of these different sort of learning time scales, right? Because the truth is in a transformer in any system of memory, you have at least 3 p of learning. The one is sort of what you’re discovering an inference, or like the income test on behaviour. The other one is what you’re doing with grade and decent. And these are different typ scales that they interact each other very interesting ways that haven’t been studied enough. I think in this working impact. And um, In terms of reasons you might have to be interested in it, but you’re I think the last one is maybe it’s just because you care about it, the systems, the biological systems do not rely on the way future systems run at all. You know, if you try as a human to try to learn the way LGPT does, you’ll completely fail. Like imagine, you know, just take a book and shuffle pages around and then start reading randomly. You’re not gonna get anything other than cooked, right? So it’s definitely, if your goals here is to understand biological systems, your goal is trying to understand how the brain works, this used to be, Judge Hinton’s sketchphrase, then, you know, the what we’re doing for neural networks is not good, like a good model of that, right? So even that would be sufficient as a reason of why we want to figure out continual life. So clear learning would require systems to learn sort of more like humans, though. So, what would, um, What would imply if we switch to what you are learning? This is sort of my, uh, The bite of my motivation are part of the way, you can train these things, which is like the way it’s continual learning is not necessarily an answer to all of the problems. Like you know, I mean, it’s easy to complain about the existing paradigm and saying how you got this wrong and it does this wrong and does this wrong and so forth and that’s wrong and that’s wrong. And that’s wrong. But answering about the kitchen is not the GD. And just because we have this alternative paradigm, contin learning, it doesn’t necessarily mean that he’s going to beat any. Like, it might still end up being extremely expensive. You I still and I’ve being as sustainable volume. I still end up being quite different from what the brain does and as we’ve all informative about, you know, brain work. But what I’m trying to argue is that, but at least it does, it’s a genuine perspective, where instead of you focussing so much on what is the performance of the system, you care more about top week, and the system can adapt to a change distribution. Um, and I think just that might enable us to figure something out. It might not solve all the problems, but sort of my claim is that, um, typically how progress is being made is you do this kind of change of perspective, now you start measuring some different metric than you are measuring before. And then certain things become a lot more simple. then become a lot more obvious and you get to make progress. So the whole point of continuing learning is that we’re moving away from, I want to get performance X on this, too, I want a system that will quickly adapt something new that comes in, by relying on what you like so far, and sort of learn something in relation to that. So I wanted to, um, to be more of a philosophical thing, but I like try to close to it. So I wanted to discuss a little bit why Why is important when you’re switch to continue lighting and you’re trying to reframe things to, like, e-emphasize performance, and then sort of have change perspective. So this is, uh, this line is about, so the Thomas Griffith, so it’s a, I think the cognitive scientist, English, from New York scientists, I’m not too bad for sure. But anyway, it’s a big name in that kind of space. He had this paper called Understanding Human Intelligence through Human Imitations. that I really like. I’ve read it, no, I haven’t even read the whole thing. I read through it. long time ago. So what I’m telling you might not be exactly what’s in the paper, but this is what I took away from the paper. So his point is that the way human intelligence looks or the way we behave and the way we do things. is because of the limitations that the brain has. And we should think of those limitations. I mean, let me refame it this way. Basically he’s arguing that this limitations act with a regular answer. So his point is if you don’t have these limitations, you will never find a set of solution because you don’t have the right frgorizer reization in your learning system. So you’re not going to end up with a new market. And he has an example that kind of stuck to me, I think, kind of, correctly. So he was talking about alpha goal. So, you know, at some point, do you mind we were doing, um, the Fargo system that we’re playing, playing the goal game. Um, many of you might know what good is, but it’s, um, it’s like chess, but by much complicated. I’ve never played in my life, but it has all of its sound, right? And usually this game is that the complexity of it is much, much higher than chess. It’s a very complicated game. Anyway, we tried the system, we had these competitions with the world. I will, this adult was supposed to be tuning soon into the game. Um and then Alpago one and it was sort of a big moment for demon. And when that happened, in one of those names, the GoA agent made a move. I don’t know if does move 4757 I don’t know, it about one of these more because all over the news everyone was excited. And the point of that move was that all the analysts who were looking at a game, they couldn’t make sense of it. They were saying this move looks like a mistake. Like it makes no sense to do this move. But then it turned out that that was the move that made the agent win against this at all, because somehow the move was filled out to be extremely important, much, much in the future of the game, right? So it was a move more in the beginning that everyone saw like, okay, it’s a mistake, but then it turned out to be a hotel towards the end and it’s like, um, um, to the mean. And everyone, like, were calling, while calling this movie was alien like. There’s nothing human about this move, like no human would ever do that movie, made no sense. So what someone is that mistake is trying to explain how this might have come about. So his point is that humans, when he tries to solve problems, the standard way we do this is decomposition. So we have sort of this divide and conquer kind of approach to solving any problem. So but you try to learn how to play go and I think this the same as we find this intermediate goals that we have to do like okay this area, do this, to that. And the way we approach this explanation in that space is through this, right? You can put the problem into simpler ones, the simpler ones you can understand. And then from those we compose a solution. The law agent is not compositional. So the boy, you can never tries to propose a problem in the cellball, or anything like that. It just tries to solve the problem. you know, from the beginning to go. And it does just doesn sense by good voice, right? It just does a huge search and you try to solve the problem. So his point is that move looks alien-like because it doesn’t serve any sample. There is no decomposition of your problem that will show this movement being a step towards something. It’s only a step towards winning the game at the end. But, like, if you need a little posit in order to understand it, you’re not going to get that. So this is sort of another high level recommendation. So he’s saying, this is, you know, human identity that move, because we can’t solve the problem in one goal. We have to decompose it and there’s no decomposition for this makes sense. So the reason why I found this explanation kind of interesting is because it points to something, we just feel kind of struggles with for a long time, which is sometimes we have some of these properties, like for example, traditionality, you need to pay a price. So another way of framing what, uh, Tomas Fitz is trying to say here is that probably humans will always be suboptimal ago because they have to decompose the game with sub problems. And then there is some strategies of winning the games. that don’t involve any composition. And those are not really sort of our reach. We cannot, we cannot do those strategies because we need sort of physical position to make sense of the game. So the point here is that I’m going a very long way about it. But the point here is that, for example, compositionality, which I think is a very nice property of reasoning, might not necessarily we call optimality. Like there might be ways of solving certain reasoning problems that are plants that are tracked there because but not compositional. So therefore, whether building the systems, like if we want a compositional system, we should probably most likely expect a dropping performance because there has to be a price to be paid for this. Like, compositionally has benefits, one particular benefit of compositionality is that it allows you to adapt very fast, because whatever you have a big problem, is you’re just repomposing it from video solutions. So that’s why it’s so fundamental to humans. Compositionality is potentially one of the main mechanisms that we achieve, we used to have very fast out of ability to new environment because we’re decomposing things that we’ve already known before. But in might come at a price, right? So maybe we’re very interested in adapting, but we probably will never be able to perform as well as a system, but not composition out that it’s only doing that one thing, and it’s just learned to do that one thing. which is, I mean, sometimes is not surprising see this all the time, right? When you have, like, we take a machine imaging system that is straight all into one thing, like only play chess, it’s always going to be better than any human. And then the issue is because that agent does not need to obey by, you know, that season doesn’t need to be adaptable. It doesn’t need to have all these other side properties that a human needs to be, in order to just be able to survive in the world. You know, otherwise, you can’t just do one, you need to be able to interact with other humans and all that stuff, right? So, so the point here of like this that I think, I mean, there’s multiple points in this paper, but I like to bring this point because, um, I think this is a been a career in appeal for a long time. It’s a failure for CL as well. But what we end up doing in papers is new. so for HRL, for example, you do the sample. You come up with this HRL system, HRL, basically, is very, you know, it’s a propositional structure, right? So you say you’re doing it for small learning, but you assume your policy is compositional, it has sort of certain levels that interact to each other. So you make your system, and then you compare it as a baseline to the system that is not HRL, that is trained on the same task, solve it. And it’s always the very side always does the best side. And I think this is the reason why HRL, my life, this is a failure, but no one managed to get HRL to work. And the reason is no one happy to work is probably because it will never work as well as the non-eratical system. It’s not going to work as well as a non-compositional system because there is a price to pay. Now, where it has become tricky, particular thing about papers and research, is what is a good price to pay? Like, if, you know, if I come up with a system and argue, okay, my system is compositional, but my system can continue learning, and I argue this system has to perform worse. Like, what is an acceptable verse? I don’t know. If my initial mother was doing um 78, if I get a system that only has 60% accuracy, Is that good enough? given that it can also adapt fast? That’s a really hard thing for people to drive is really hard to get a paper published in that kind of argument. But I’m just saying that this field, like in some sense, that because of this sort of benchmarking, philosophy, that’s on regular field, where you always get to compare to a Bayside, it doesn’t do what you’re doing. And the assumption is that if what you’re doing is good, then you have to perform at least as well, if not practice, then you’re very high. And I think the point here is that when what you’re doing is you’re adding extra structure that is meant to help you in an autopolitan way. So particularly compositionality, it should help you in adapt communities, then, like, you might have to pay a price somewhere else. And that price might actually be bigger than we expect or something like that. So this is sort of, I guess, the message and I thought you started like an interesting. It’s a bit one of the specific side, but I think it’s a sort of an interesting respective because it kind of highlights why the field has struggled so much with this kind of things. Like, compositionality, it is not as much a thing, a machine line. we don’t really have strong compositional systems that can build things. And it’s not because people have not tried. Some people have tried really hard and a cognitive science is a big theme. Like if you talk to anyone there, you know, this would be one thing that I would want to see from anything, that the system is somehow compositional and how we do things. It’s just you never got them to work well enough and this year is trying to say that maybe, you know, we’re looking for the wrong thing. we don’t we don’t know how to calibrate ourselves when we’re looking for this certain behaviour. So these are the one week later message that I got finished 25 and it’s also disabled me away, but justified. The other thing is the other thing I said in the beginning, which is the main pieces of the paper, actually, is this idea that the limitations of the system have, you shouldn’t see them as something you’re supposed to fight, I guess. You should see that on a regular item. it’s actually helping you. So here’s point, for example, is if you don’t have limited inference computers, so that he’s arguing that, for example, you know, do not have to give an answer. I don’t know how many, they can only do this many flops or whatever. you pay the br is limited. And and all of that stuff. right? So, so you have, so we need to compute that by training limited compute for, for, for, for new friends and how could we have to finance it, how many calculations you can do that, I said? This is what led to compositionality. So like because of these new rings, then when you try to optimise the system, the only solution that would do well is the solution that I have position. If you only remove these limits, like if you say like that’s the issue with Papa, right? Even if you want to make it competitional, you remove this limit because you’re training on the class that it has many computers and the inference, you use as much computer as you can, and you push out of this in parallel. If you take these limits, then obviously the compositional solution will not be there in your sex anymore, or it’s not going to be a optimal or other things. So his voice was that if we want to get AI agents that deal or behave more human-like, then we should train them with senior constraints and the humans have, in terms of computer, they have access to in terms of how they interact with development and so on. And as long as we don’t mimic these limitations, we just don’t have the right regularizers and we will never find exactly the same solutions.. This kind of he’s phosophical, they can generally say, he’s like, yeah, you should think of limitation as something that you should exploit, to kind of shape up your surface. Nothing is something that you need to somehow work around or like remove from the system because they’re just harming what you do. that’s sort of people better. And people agree with that. pretty understand but I have a question, right? For example, you have a road at a zero, but there’s so much training. they end up using the same operation as human sometimes. this is also like her with composition. Because this opening is certainly something very understandable. I mean, Yeah, that’s a very good point. I sort of I the way I would read that in is well, 1st both of the systems like has 0 number the Apple will definitely was train on human data. So it’s just the imitation kind of process that you’ve seen this many, many times and then you just like imitating it. None necessarily because you’re trying to compose the problem, which is because that’s sort of what you replied. I guess enough zero you li with but I from sc. I remember that was I putting pictures that appro. So maybe this argument doesn’t help as well. Um, I guess the point is that this statch that serve a role in sort of composing the problem, they might also be very good moves where you are composing the problem. And this is sort of the problem, the perspective here is not necessarily that the moves we’re doing when we’re doing these decompositions are always available. But sometimes there are other moves that will not, that you’re not going to have access to. You look by there. Yeah, it’s not my direction. I mean, I think in a lot of scenarios, and I’ve seen this, like, even in, You got to take this place and a bit more controlled. You can look at algorithms, right? And sort out learning development, you know, being like so it and stretch and whatnot And, I mean, a lot of instances, for a lot of problems, like it’s kind of invited, and some very, he’s optimal, like, you know, you can mathematically prove that this is sort of a lower bound of how people can sort something, where you can do this, right? And there’s no other way. And a lot of these algorithms are sort of, you know, have this sort of structure, we’re composing the problem, the sub problem, and we can. So it’s not that Compositionality is always optimal, is that in many instances it might be sub optimal. and we don’t always know when that’s the case. And I think this is, yeah. But obviously, okay, I presented this as ground fruit. This is just a position, This is just a take. Basically, my main interpretation of Thomas’s newspaper. Like, it doesn’t have to be true. So, you know, like it’s really perfect for it for people not to buy this particular argument. But this is an argument that I, I mean, I find it like compelling, but, you know, I think, This is the part which is, this is the hardness of the kind of stuff. It’s really hard to prove it’s too around. Like, you know, I don’t I don’t have a way of saying that, well, I can prove that compositional structures in Starcraft or whatnot, they always have to be some optimal, and there’s always a solution that means that there is not going to be. I can’t really necessarily include that, but like I’m suspecting that, for example, when maybe Alpha starts, inside of the system, they were staying inside that, they found the same thing, so they were trying to do this hierarchical adel system, which Jirai Palaro makes you a compositional colonel, and then never worked. The agentired passed was not the right, but wasn’t doing anything. The agent that worked best was actually something that did a lot of mutation writing. So just look at human player, try to even get exactly what they’re doing, and then root for us. And just start right around, like, none of the times here, team, we work. And I remember right in the day, it was a real disappointment for a lot of friends in D-Mind, because they thought, there is no way when you take a game like Starcraft, I don’t know if there’s no Starcraft, it’s relatively complex. There’s no way you can solve this game unless you create all of this stuff fast and you have all of this kind of compositional behaviour where 1st I can start my base and collect minerals and then I start building an army and I start doing this and that. There’s no way you can do this for truth for us. like truth force is not gonna work here. You’ll need some form of, you need some form of information. But the upper side doesn’t have any of that. It has a bunch of imitation learning and then just some training on a retourse, right? And there are some tricks in the paper, I’m making this sound of my personalities. There either be some nice stuff in that paper, but the B pulps that people had that did not have money, turn out the same ones with Bafago, we turn out the root course, work really well. And I think on the back of this thing, you have like the bigger lesson from restart and just kind of in that train. There’s like, oh, you don’t need anything, you just need scale and it’s gonna work. And I think you’ll probably be better about half a star and are some other things that will happen in that interspace. But I think the reason the bitter lesson works is because we’re only looking at the performance. in some sense. I mean, you can interpret it this way as well. Like, obviously, if we only care about performance in domain, you don’t need anything except scale and data, and you throw it at whatever you actually have. And so far, you are empirically, that’s what worked the best. And it’s always going to work. But if you care about something else, then this might not work anymore. Like, for example, if you care about the whole degenerization, I don’t know, some weird symmetry that you peak, whatever that is, I like, you know, like destroying data is not gonna fix it. Like example, you let me give you an example I had for example, I have. My people in the, yeah, so so you want to learn to soy to this the website at night, and then you want to generalise to my computer. You can make the model as big as you are, you can put all the data, all the possible list of lengths, K, is still not gonna generalise to, like, 10 K. Like, it’s because it’s – it’s a sort of inductive bias that you’re not gonna be captured. like it’s gonna work perfectly in themain. You make them all the cleaners you want. going to be able to sort any list that you want that has K eleS. He’s not going to generalise. You’re always gonna find my point longer is where the system is a little bit. Even no matter how deep you make it, no matter how much training you do, no matter what things you do. I mean, obviously I have done that with experiment, but I’ fair to something that there’s going to be the case. And that’s because at that point we’re not looking at in domain generalisation anymore. So the kind of generation we’re talking before, you’re looking at something else. you’re looking at this sort of extrapolation behaviour. And this extapolation, I don’t really just going to be solved by state. This is all my taste. the bit lesson you either familiar with that low post from Gaten, there’s only true if you stay in domain And I think that is a mistake that a lot of these things have done. They just say, you know, I mean, um, this becomes, this is a side note. This becomes even more painful when you talk to ourL folks. So I had a the opportunity at some point to go to restart. And one thing I found very surprising is he said he doesn’t understand his sponsor, but test. So you’re saying in everything is training. Like there is no test set. There is no separate thing. just one thing, imagine yeah. So that makes it even harder to discuss these kind of things for me because huge well, he doesn’t even believe they exist the conditionation, but he’ only training there. So try to convince him that there’s even something after in manualization, it’s even a harder because for him it’s like, no, there’s only one thing and you’re just learning, listening and that’s it. and you have access to all of it. There is no, no test like, because in a relatively, there is no, like what you are training on is what you’re avoiding on. They’re only one environment. It’s not like you have a training environment and validation environment and you’re training on the training environment and taking on the violation environment. not this sort of separation, but I said, of course he was trying to be, I don’t know if you ever met him or like Joe, we should have a chance of talking, but he is a very kind of trollish kind of person. I’m pretty sure he knew what I was talking about, but he was just trying to get me annoyed. So he really likes to get people annoyed and he likes like digging in some position, even if it’s absurd and I’m trying to depend in all kinds of technically correct arguments, but that his period don’t mean anything. So actually does a lot of that. Uh, let me check the time. So we have several more minutes. Let’s see how far we can go into that. Um, I spent a lot of time on previous like. So this like, maybe I don’t, it’s actually not as important as a previous one. When this slide is saying, one way to think about this problem, one way to move away from this, if we have some kind of magic metric, and I wrote here sort of like a type of, this magic makes no sense, but like if we would have a way of not looking into the performance, but waiting the performance by how much it took us to learn, you know, how how expensive the infant process is, all of these things, these are kind of the constraints that a human brain might have, right? Like we bound it. I mean, like for humans, you might be more of a heart constraint on a self-constraint. Yeah, it’s more of a substance, right? But basically what I’m trying to find here is that what these kind of things suggest, is that what we want, if we are to make this into a single number, what we want is a number that takes into account all these additional costs that are important, right? Like, for example, how much high quality knowledge already is in the system versus how much or did it have to be suffer for data? How how long do you have to learn? How long does it take to do interest, to make any prediction? What is the data even allowing us to learn? But what’s one information is within the data, one information, and the data? So the idea is if you have an objective like this, the claim here would be that you will find very large number of architectures, that would be much, much better suited to discontinue any problem and it would have very, very graphic, different kinds of properties. And I put this as sort of like an aspirational target that, you know, people in the field might have. But obviously that that, that, sorry, this formula is not fancy called inner. We don’t know how to measure any of these communities. They might need to have certain ways, and it’s not, you know, clear to me that it needs to be very proportional to this and proportional to that and so forth. But I just put it there as a, you know, this is sort of where the issue is. Then we just look at the performance and we’re not correcting by all of these other factors, what you might care about. This is why the slide was meant to say. So it’s a bit of a hand by bisa, like, is trying to kind of claim that part of the problem of why we are, where we are, and we have this algorithm that apparently working in me well, but are limited in all kind of ways, because we don’t have people account for these other side properties. So the answer of the community to all of these problems, it’s sort of, by the way, I keep using continual learning, so this is the training that I like. It’s actually this game about, like, the fear of this serves, so their paper is about it in the early 2000s, maybe even the 90s, um, there were some like PTP this year and there on this and some paper and that’s it, but there wasn’t much talk about it. And then it kind of exploded in like 2017 or something like that. And that’s when it kind of became appeal. But the issue is that actually, it is not in different places independently, and then you have continual learning, you have lifelong learning, you have never-ending learning. You have open-ended learning, you have, and at some point people don’t actually know what the difference between all of these things are. So you have like 7 different names, many times they mean the same thing, sometimes they mean completely different things. So it’s a very, at least for a while, it was a very massive yield. I think nowadays, it’s becoming more and more consolidated around continual learning. Like, it really helped a lot of the fact that the wealth folks started calling, they continued learning a lot, because they were telling me, platform dining, and they found the lead community, and then there was always a piece of the, you know, these people calling me, this, is propelling in that. But anyway, it’s sort of like a feel that evolves organically, but it’s also a appeal that is, um, It doesn’t have a very strong kind of like core definition or, you know, like if I’m taking other topics, usually, you know, there is some formalism and some framework. There’s some book written by some big name somewhere that is setting up like what this field is about. And there’s some formalism, there’s some definitions and, you know, everything else follows from that. Material learning gives a lot more fluid. Like different people will give you very different definitions. Those definitions will be contradictory sometimes. They will fight each other. And there is no background through. There is no like established entity that is like, okay, this is what continue learning should be, and this is what we should do. So this is 11 aspect of it. The other aspect is I wanted to say is that as you start playing with the systems, I am very familiar at the moment that there is not only one continuating problem. You know, people usually claim this as they have the criteria, any problem, and they’re going to be the casino. But there is more like a sp of problems that will require fact more different solutions. And actually, it’s not even clear to me that that solution would be, like, I’m pretty sure the solution would have to be by defence. So I’m just doinging that app let you know that, you know, where where all these messiness is coming from. But yeah, if you want the initials and these are 2 possible destinations, one is a simpler one. you just say continual learning is when you don’t make AID assumption. So when we don’t assume there is a starting distribution, whatever you do then, then that’s what you know, right? A pretty wide definition. There is going to be things in here that I’m pretty sure some researchers will say, well, this is not continual. I mean, this is that’s whatever that I think, because it’s a I grew up and added, uh, like, for example, the prequestial writing stuff, right? doesn’t make the idea assumption. I’m pretty sure Marcrosata or your good map say whether it’s something pressure learning and this is sort of doing. It has nothing to do with your ar team. But that’s kind of the point. It’s a very like thing. Another way of defining this is just closer to what I care about and I’m going to show you that this life is that it has something to do with the credit assignment mechanism within the neural network. And, um, and then trying to fix the creators type of mechanism, maybe I need to explain why the creators I’m imagining remains. And the crazy side of mechanism, what I mean by that, is what you’re doing, lighting, and you have a network, you observe some error, and then you have to decide neat ways are to be blamed for that error. And you have to correct those ways. This is the kind of side money problem. And usually we resolve this to grade then is that the way we decide which way we need to analyse, that’s how much of a penal lighter is done at grade and design. So my, one of my personal take on continual learning is that continuing really about understanding weight and understanding, and how great the design gets, this great assignment from. But obviously this is a very narrow definition. There is a lot of things people can continue learning, like for example, this memory management stuff, that will not fall into designation, right? Because this is, this is that you should make it sound like a cleaner. it’s really about creat desent and how great incent works. And obviously, there’ll be lots of people like Raha, who are talking about right before, who would disagree with that and say, well, no, like this, this name with memory is not a human, I mean. And I actually agree with that. Um, sorry, it’s important why I did. Yes, we’re all of time. So I think we’re going to suck here. The lecture was moved in the afternoon. So I think we’re going to have another state in our independent, correct?
LECTURE 8:
Yeah, I don’t usually teach, so I was kind of like struggling to figure out exactly the structure of words. So my goal was to try to help where what I think are the fundamental aspects of the inviting, that if you understand, is going to help you navigate sort of the super rich teacher, is just not there. So that obviously means that I’m trying to cover a lot of crowd and then covering it is sort of very investive in the way. So yeah, I can appreciate that it’s like I just want to make people well be a bit more comfortable. Like, I’m not in the effort, I even give more, like, even the homeworks, like, they look for the homework, like, very scary. I don’t know, I don’t know how many view it is. Like I’m not going to be very gratical but when I’m very these days. Like I’m just sort of wanting to see that you’re putting an effort and trying. so to understand to a certain degree the kind of things that I talking but you know, like don’t don’t more stress Like, don’t, don’t, uh, and I’m not going to expect you to do a lot of magic things in data. I will I will try to have a past next week. Actually I haven’t done it but I heard it. And then that would that would be a long test. So I’m after better time account for your great, but it it gonna be so you kind of have a sense of how that my little life I to know what the expecting that I agree when when you have the final job. which is gonna be like most of your grade. I think we said probably one of the but means was like. But yeah, I mean the work here is really hopefully you’re going to get some patience about systems and then you kind of know how to reason about them and how to navigate some of the Like I, you know, the details usually doesn’t matter. But like if you have the right kind of thinking behind this, that will make it so much easier. In the future, if you decide that what you’re trying to do is impact a PHD. Do you have this discourse of what PHD is on a master level? But like if you decide you want to do a PhD, you know, you already have a house that is covered in this. person you kind of know what doesn happen, right? So the other thing that is stuff is, you feel about migrating, what technical problem is about the PC. So I’m not going to be, there is three, four. We do super hard. Partly for the test, because I know that sort of, for home course, you have time, you think about it, but that’s really. Okay. Oh, I wanted to say that. With that before I, I like any questions or any does I say need to be straight to the slide for any passenger bring out before we jump into the into the that. And now that you can go back there. So, um, where we start was discussing sort of definitions about the media writing and I was trying to make this one that there are many definitions, not all that are near each other. there are even different names where they feel these after saying. popular different names like when learning. the nonsense that I was mentioning all aspect. Because I organised that is kind of centred around these kind of topics, even the conference is called the conference or lifetime changes. I don’t have know any of that. So, just, you know, there’s a lot of, these, these are not going to be used, they’re changeable. Oh, did you change? change another action. No. Okay, so that says that usually the way particular learning is introduced in most papers, it’s sort of like a way of having a another picture of how people are thinking about this. It’s usually natural this there are. So the idea here is I have a system that is lighting on a CPI, some tasks, one after the other, and then I’m expecting the system to behave in particular way. So, um, And that sort of one question is, okay, how do you formalise sort of this non-ID data? So this is done as a sement of task. This is by 5 difficult. and then you have choices. how this past traditions like am I jping for transform with us two? These are sort of like a small transition and slowly moving from thatform class 2? are there repetitions of the same task within the sequence I have to do something do something else and then go back to the list. work in the task plan, this all are so, in terms as hyper parameters, about to find the, the particular learning problem that we want to solve. It’s how we set up the problem. And they matter. So that’s why the field is so messy because like he might have discrete task boundaries. so I do one thing and I switch the l. I would grow up by a different kind of outlider, then if I have the sort of no task, but I have this sort of continuous grief is what I’m supposed to do, right? Imagine, you know, like you have to classify images that I take in over time and the weather company and slowly, you know, see snow and the snow disappears and sse seeing sun and all that. There is a that is a smooth tradition, right? Weather doesn’t change your right. At least should sometimes it does. So, so in that case, you know, you’re the guy, different kind of popular compared to a situation where where tasks are completely different, like 1st tasks of and this 2nd task, do you need that or something like that? So, um, Usually people are not very explicit about this. So the way this is usually done is by very fun. So, but you know, writing has any other field, the way it grew and the way it’s been so far is it’s kind of a benchmark equipment where you have a bunch of additional landmarks. And those menmark calcul creat a particular ways. You know, they have some particular choice on the task that you do those tasks to be supervised, they are supervised to the RL and they have certain types of transitions. Okay. And then we have a list of these data that you’re expecting from the sister that is learning on this sort of non-IM data. Um, and um, there are kind of stuff from things that, um, I want to have email access to the previous tasks. so that there is this transcept that I’m kind of move forward in time and once I’ve seen that one and I go to that 2 I don’t go back for it, right? I just only now seen the new task. I have fix computation things capacity and computation. So, you know, this is like an extension that doesn’t magically grow over time, even though there are many other going to do that, but ideally you want to have a bound. You’re thinking, okay, it’s about it, agent. This is sort of all you can do and somehow it has to deal with this sort of sequence of conversations. And then you want to do things like you my status projecting. So what I mean by actually forgetting is sort of observed the difficult effect that if I learned past one, I said if I learned and this, and then I switch to like C five. the 1st thing that you would observe that the system immediately forgets also so well. So as soon as you start learning something else, the 1st thing that you do, you completely break whatever you’re doing before. So this is the seen that paros for. And usually what you would expect is that a system can learn a new task without actually forgetting the previous one. So I think one of this what is supposed to show kind of sorry we’re forgetting is this one, basically the idea is to solve that one like so does 2 gl goes down and you go to search soft that rent those. So there’s this idea that, as soon as you find something else, completely breaks, you want to maintain plasticity. So the other pl here shows on my system means. So this is this is one pathology that you see in practice, this is a 2nd pathology, that is in practice where what happens is after you learn task one, you’re just unable to learn anything else. So usually a laptop goes into one of these 2 states. Either after it learned something or you started something else, you know, maybe forget what you did before, or the opposite, you just don’t like anymore. And then we don’t forget, but we also don’t like. That’s another way of solving the industry problem. So this is sort of thing that you want to avoid. But what people want to do more is that they want to maximise over the life by transcribe. Before we transcribe, connecting to class, we forgetting is basically opposite, which it says that if I write a task, say, there’s an example of that. I say you are some sport. I don’t know, football, you learn to play football that you try to learn to play basketball, there’s skills that many can be shared because you know sport I have no idea as for that. I’ not go where one of us prepared. But assuming there is, you’ll expect that when you learn their seconds for you’ll actually learn it much faster. The idea is this sort of compositional kind of aspect of me that word discussing area So this is what people are actually looking forward. So they’re not necessarily interested in just saying there’s not going to be any kind of something forgetting and the system will maintain performance as it goes, oh, but they want to see this speed up in learning where the more you know, the faster you try new things. and this is something that most usually do. Nowadays, companies, do the the the LLM reasoning, right? So it consumes a lot of compute. So was there any research that was done to try to make use of previous reasoning traits that were done. Uh, So you mean, like, Okay, so… Let’s say, let’s say I give a I give a poem to ChatGPT and I and I turn switch on the the reasoning mode. If I ask it a similar task next time, it will go through the same reasoning steps. So it doesn’t make use of the reasoning phase, right? So was there any research that was done to make use of this reasoning choice? So that, yeah, in some sense, yes, in some sense, no. So there is like, obviously not, at least not deployed. Not in my system, because the system mistakes. And usually depending on the terms of service, they do not collect data when you are meant to interact with the system, they probably do, but, you know, depending on the data. So I don’t think there’s any kind of explicit machinery where it’s looking sort of on this, but I think people have been playing where you’re saying, okay, can I disteal this reasoning traces back into the rates so that I don’t need to do that as well. So there’s these are important that I don’t have a very wide mind on my head. But the outcome of that is that it helps, but it doesn’t fully work. So the reasoning process itself, it seems to be doing something different. It’s not like you can’t really fully distillated ways. It feels like having this sort of freedom of doing this, it’s helpful. And like maybe when I hear that, I might end up with a bit more about it, but it’s the same reason why for Alpha, go, you can’t get your MGTS and say I’m going to be still listening to the policy directly. This is like basically like a form of search that you do like the inference. And I’ll give you my own personal head, well, this is not this, this is not published work, but I’m expected, accepted explanation, but basically this reasoning allows you to have access, kind of functions, you do not have access by learning, because learning forces you to learn functions that are smooth, deciding that you buy some new effort. Then when you have this kind of explicit search, the inference, this allows you to access things that are less human. I think there’s sort of like a fundamental difference in sort of behaviour of the system. So, but yeah, I love all the answer is you’re going to improve the system a lot, if it is still, this reasoning traces in the model, but you’re not going to achieve the same thing, then usually that’s why, because if you could have, it’s cheaper for companies to get rid of this and just give you the non-reasoning system, because that’s your proper time. They could be still all this reason in the model they will but it does doesn give you the same performance. Yes. I’m trying, after that, regards to the difference between continue learning and metalism. Is it, is it, like, to which initial voice on the update in view and impress? Is it the normal initial parameters or is it the one that are updated in the last step, on the last 3rd and last? Because, um, I’m assuming in person I learned, I adapted, that’s, and then I think there’s new initials, uh, more parameters, and then I adapt with the next stuff. Because then you metal, because I’m always up in the restaurant to the initial metal and in which… Yeah, so they will use or you think ratio, like it’s perfect. So, so, when you’re learning, there is some, I, I, I, I, So, okay, so continue writing is exactly what you said. You’re basically another way, maybe maybe this amount is simpler, the problem is, you’re doing a series of antiquities. train and countries. In metal learning, there’s not the case. Metal learning is more like m task gliding. You show it all the tasks at the same time that youian scenario. The only difference is that your goal is not to learn all the tasks. You’re showing on the task at the same time, but your goal is to learn an optimiser that will help you to learn any of those tasks much faster. So you are basically by showing all of these tasks, you are trying to extend what is the common structure of these tasks that will help you to learn to tell more pastor. Just to give you an example of a particular family of a talent avoidance in my making is much in blood. So I don’t know if you’ve had a mammal mammal or about big thing. So that’s a. I actually I don’t want to explain my number. Anyway, so, something similar to drama, like you, there is what does it cost? That’s those are people. So you can imagine for example, that you want to put precondition rate that you said so you got about with commissioners. So you want to be a free condition potential mat. You don’t know what the inter should be. But you have a series of tasks. And your question is I want the time to get started I can learn any of this task pastaster. So then once you’re doing it, if you’re circling tasks, and then what you’re computing radiants in respect to the pre-conditioner matrix, and you go to the pre-conditioner. And this is a metallic approach because what you’re learning is in some sense, you’re an optimiser. I mean, specifically you are just learning this recognition on athletes. So it’ss a very kind of arictctive form of. Like in mem you’ll learn it. It’s the same thing, but you know, instead of saying, I’m in a lot of air conditioner, you’re going to say I’m in a life, an initial intalization of a model, such that anytime I learn, any task is going to be easy to life. So this is kind of a difference. metal life, you’re trying to learn something about the learning language itself. So that’s why someow you are popular to call this learning line because it’s then in the right half. So you any propert is a place. But what you’re learning is you have access to all your tasks. And now that you have to do this questions. In the writing. you do not care about I mean, you do, but it’s more fit around a solution rather than the letting up better yourself. So if you’re looking for Ford transfer, you could add that there’re still trying to change the learning processes go forward the time because that point was not pass. But then the youngest, it’s really about this sequentiality of the tasks. You never have access to everything. You have to see that in sequence and you have to go backwards. You dance something wrong you done something wrong. You can’t go back incorrect. This is not one of the differees. But yeah, there are There is a point where everything become very, very similar. So like in absstance I completely get for a very confusion is coming from because there is a point where a lot of people are arguing that not learning is before we continue learning the clear learning form of mal learning and all of those kind of stuff. So they can become arbitrarily close depending on how you formalise them and how you tell them. but technically then I reset it. And particularly particularly it’s really about not being IND, but not learning, it’s all about figuring out a way to speed up learning by looking at the data. Any other at this point it has made my access to the team yourself. Yes. Why is it like that? So, okay, so 1st of all, but here, I think people are usually kind of peak which of these things they want they care about, and you can draw a certain amount you don’t care about. But usually, for example, if you have like remotely to do that example, that’s given here. So you have a robot, that is, you want to learn as an expert, like you said, a robot to the moon. And obviously, you know, the product cannot communicate to us to go for past step to learn things it has to learn only life while it’s there. and then it has quite capacity has, you know, finance and so forth. So one thing you can’t do is you can’t just score everything that you see and then imitate IAD learning a sound body, you and you can’t go back because, well, that is reversible. like even if the model wants to like turn around it’s not gonna be exactly the same observation, same same situation. So you know, information comes in, you can’t start it because you don’t have the capacity. So you need to learn an average of the 10 continuously learned without needing to go back and look at the tasks. Like I say, anyone, because what you usually talk about is keeping some observations. So you have like a replay buffer of swords that you give some information, but this replay buffer intends to be small. So you can’t just score everything, you can store some important observations, if that’s somehow helpful, but you transfer everything. And you don’t have to, yeah, approach this problem of like, which are the most, say, yeah, which are the most important observations that you need to store, which I think that you can throw away, how you decide is, these are all kind of in the space of the field, the other way that the people are doing. Yes So here lost. Yes. And then we have to add some terms, for example, for the 1st company to, like the term where we can minimise the access to this task, also observe to minimise the important general system. Yeah, so we the same Yeah, yeah. So we’re almost not talking about solution. but yeah, like this is the rather this is what you want there still to be able to do. How do you do this is sort of up to the upper one. But yeah, like for example, for the, for the, for the, one particular approach is you are an additional turn for the loss and the regularisation, that is that the benefit from forgetting. So you have turns to, but it doesn’t have to be ideas. So for example, another way you put approach is problem of forgetting, so I guess example forgetting that’s usually easier to understand is not the best part, but you say I learned that part. No, I pleased this ways. I add new weights and I just learn in the new weights. I mean, this is not a proper solution, but it will be technically a way of doing it, right? So you’re not forgetting just because you’re not allowed to price this space. And, but obviously, it just means that you have your car under the damage. Sometimes it’s those that can have point number two, there’s not, you know, an ever knowing amount of different size. But it is a, in theory, it is a potential way of doing it. And then there is a, you get no variety of proper aligners by saying, I learned. I compress, and then I add 5 metres and then I can still learning. And this way you found sort of how much of a number of characters grow because you have this compression that. so you know and and that that’ll be another. So I don’t know that kind of makes it clear. So yeah, most of the time this thing will turn out into additional terms as you are, but if you actually want one function, it has all. somehow to keep us things between. Because we. You can keep the weight if that contrast. Yeah, yeah. that’s what I said minimum. I guess you can have a small exles or you can have the weight. I mean, it’s just sort of up to the benchmark to describe how much you can keep. and but you need to keep starting, but otherwise it’d be impossible, right? I think you have some sense of… How do we define a lot of specificity? I’m inferically saying it’s performance and metres. Um, So, okay, so my system is a bit of attractive. So there is still a definition of plasticity, it really is about whether you can minimise the loss of the data, but that’s, I mean, okay, so there’s participating in your science and persistency in machine. I machine when we say, I have lots of acity. I really be looking whether the loss is going down because I’m trying to organise. There is a beautiful vari of plasticity which is when I learn on your task. My training error goes down, but my validation loss does not. And that didn’t happen as well. So there are 2 guys. You either you cannot optimise so my loss does not go down or and I talking with that but that’s going to have to learn out It depends. So, um, You don’t say you need to have like that to see that you’re going to have plasticity. It depends on the properties of the model. So usually, for example, you lose plasticity when the all bec real conditions. So like if you computer has yet and you see that the model, the spectrum has very large and very small lighting values. And you’re using regularly sales, you know that you’re going to be in trouble. I mean, if you use a premission at correct for this disarity, you’re not in traveller more. So to be fact class basicallyends on what you’re using the life. like a system, for example, my using plasticity I made in itself, but might completely plastic if I’m doing matter of radium because natural gr andrect the temperature so that the amount gets still make a lotoss go down while they have a centre to be stuck. There is sort of kind of very drastic variance of this where, for example, one into the plasticity, you get all your, you have level models and all your level units without that. And they don’t problems. And then you close specificity because well, you know, I somehow have very strong biases. I learned from the 1st task that all my biases for this layer like minus obedience. So no matter what input I provide, the amount is going to be small, then the amount is going to be zero, I’m not going to have a radiant, there’s not going to be anywhere you have. So no matter how much they touch all the systems, whether I use natural rate or whatever automatic I want, it’s not gonna move because the gradient is an exact zero. So there’s no signal which way you’re supposed to move. That’ll be another pathological form. I mean, it’s not going to happen in practice, but it could happen in a way.. Oh, yeah. I’m not sure if that means you can do the same attraction, the same data, and there’s some more, let’s say, to hear your learning, friendly optimisers, that other optimisers, that I can get a very good model in theory of learning, with very high plasticity, and for the same model, the same data, I can get a very good model in theory of learning. Yes, that is true. So you optimise a calling sound well. So if you look at the literature I mean, maybe the one is it, so the tomato is obvious in the future, is that everyone uses tablet. So when people talk about plasticity, they don’t have quality of theato because the assumption is that everyone uses. And then like the init has to come from somewhere else. So, like, for example, if you are Hashian is sufficiently in condition to make Adam struggle, people will say there’s lots of assisting because, you know, they will never run, I don’t know, fancy optimised, right? But defult means run. But it’s true that by just changing the recogniser, you will be artificial. In fact, it’s the same, it’s true that’s why changing the architecture, you get sort of very different integers. And actually later in a slide level show again, I’m teaching this before I keep strong sense, but there’s a credence that use residents, but as a matter of fact, they have a speed connection and all that, versus BGG architecture. It doesn’t have ice sk connections. And you’ll see that we have completely different things. So like by just choosing the architecture, yeah, you change all of this propertyerties. even if you haven’t changed how you learn anything like that. Um, So number 5 star, I wanted to say, um, backwards transferred, maybe that is a little bit less than three. So this happens for biological systems, which is the idea that you learn to ask what, and you learn classes too, and because there’s a relationship, then that’s what I trust too. You actually become better at task for. You’re learning task. even though you are not actually doing fast. So for example, yeah, I don’t know, like doctors like play video games, because English their say the energy become better likeical practice, right? So even though you’re not actually doing anything, there is a very good task one because there is this relationship like when you task to like we get sort of improve our task one. So this is by far something that I almost never see, which is likely. learning never happens that you learn that’s why I learned after then you suddenly up better but learning for. But this is something that people would like from the system, right? They would like to have the property to be able to improve on things that they learn previously by learning something new. That’s a collision. In this particular, like if you want to see this kind of effects, there is to be a profession, this interest. If the task type of thing independent, there’s nothing to transfer. there’s no concept of So this is actually a very good point. All of these effects, compared to the data. if you have tasks that theyographing in front of each other? There’s no point in expecting the power transfer. There’s nothing to transfer, right? There’s nothing. So this is another one of those silent parts of the definitions that usually are not made. So it’s really, like the extent to which you can get any of these properties will depend on like what is the common structure between all these different tasks that you’re supposed to exploit in order to do this or that? And the issue why this is hard is because in general, we don’t have very good mathematical toolules to describe this. So if I give you For example, I give you an example. So one particular bench market that is quite popular is called career. So what it does, it’s a, it’s a 10 night presentation tasks, you know, to CPR or whatever, where you get a bunch of objects, right? I think you have things like phones, laptops, buses, cars, whatnot. And the point is you have pictures from the 70s to the kind of day. And that’s sort of how the notationity happens, right? You start seeing pictures in the beginning of the 70s, start those by, start seeing pictures from today. And there is a natural brief. There is a transformation that buses and cars have been gone through these years in terms of how they do. But I don’t know how to describe that. And I would not know how to explain why I would inspect by learning about our phone in the 80s. Why would I expect a machine now to be able to detect the iPhone from the 20, right? It’s not clear to me. I mean, so basically the fear what happens a lot is because we have these benchmarks, where they kind of make intuitive sense because it makes sense for us, like, you know, but like a person who never seen I mean, I don’t even know if people would have this kind of transport. But usually, like, basically the button button is, we usually don’t have the mathematical photos and for this, or anything like that, to help us say, like, even when you build an algorithm, you build the algorithm, you run it on this benchmark, you see for it transpired, you say, well, it goes to the for transfer. But you rarely are able to say, this is the structure that the algorithm is exploiting, and that’s why, whether for transport, and you can have a different benchmark that has the same structure. it’s going to work, but it doesn’t have the same know work. So why is this important? This is important because if you have very different benchmarks and you run the same algorithms on all of these benchmarks, you can get different rankings. You can get this algorithm is the best, but this bench part, but it’s the worst, but this one, because they’re exploiting certain silent properties of the data, and those properties are not respected across webstarts, is not going to work. But you know, that’s Like machine learning is a very computerical deal. Like, you know, basically what I’m saying is, yes, you’re complete. I mean, just to your point cap, the data matters a lot. It has a lot of these impacts and how the algorithm is going to behave. In practice, you know, they have the tools to describe this. We just have algorithms. They generally are quite robust across benchmarks. And we have benchmarks that are typically quite a line in the kind of structure that they have. but just by construction, not anything formal. Yes. So if it is already corrugated, a country use this corrugation of data, like a form of prior, I mean, one to continue it and other stuff. We definitely do that for collaboration on its own. It’s just quite about your information. So, like, at least like you’re going for a mathematical thing, correlation would be just 1st sort of information. And usually this kind of structure that is shared is not linear in nature. It’s not India. So like it’s not going to be exposed by a single correlation measure. Like a relationship. This is sort of the the issue with this field is that it’s all about intern right? I Like if you be in the linear space, then yeah, we could measure all these things are good. Like, there is, uh, the online learning teacher, which, again, like, my, like, is one of those fields where some people will argue that Virginia, like, you said, no, my learning, with the exception, the normal writing, usually it’s way more theoretical, and not more precise, mathematical, and it’s usually working on linear models. And there they have bounds or how much you transfer, they can measure all this because everything is linear, everything is whatnot. So you can you can work this out. Continue learning, it’s almost like online learning, whether you need networks. And here like nothing can be done mathematically. I mean, obviously we can use correlation and that’s going to be a good study. point. And there are things that look at provision. So first, I’m just saying it’s not sufficient because like if I go back to this example with phones through the years, Cor relation laptop it doesn’t have an explain of this image of a phone is related to thatmit overall. This is not a linear transformation. It’s much deeper than that. It doesn’t matter. So so this is kind of the kind of miserata. There is another 6th one. This is usually you’ll grandly find with inapers like phot talk about it. But I put it there because a lot of these things that people ignore, but I think it’s good for people for you to know that it exists, which is a property that you would want from a system like this is to be aware that the world does not expect for you. So what happens in all of these systems is say this is for like robotics, right? You get any observation and have your agent. I mean, this happens in another, another is territory. You have your agent that is playing a game, right? You get observation and I buy that place for you to decide what is your next action. That doesn’t happen in the real world. Like in the real world, everything happens around you, whether you act or not. And you just have to act very fast. But in, I feel like most of that knowledge I remember you have, then they will wait for you. And why is that important right it’s important because a lot of descriptionions that d items will start making the difference more and more expensiveive. So you’re going start as up waiting forever you have an action take because you have to put your dog kinds of things to make sure that you’re not messing after previous tasks. And that’s another the realistic algorithm because you can’t become slower and slower as time goes by in the way you have in the work. You need to always react, like you have, you have a big side of the computer, yeah, otherwise they go back. And this is another per say. that usually people in mind because in most artificial internal environment that we use, we set up that the environment always awaits for the age of decide what action it comes to take and then says, what’s going to happen next? The other thing, this is a lot of fact, And this is just one sent in this life. The other thing is maybe it’s not as the super clear always from the from the list that we had before. but some of this comes some of theseata they consider each other as a fundamental in the someental way. So for example, You can’t never forget, but have things capacity. Like, you never forget, it means you keep accumulating information, you know, like that doesn’t fit in a fixed capacity mode. Like, you’ll need to forget the sound. Otherwise you run out of capacity. So you can’t, you know, remove that as only forgetting, but have these capacity. You need to decide you try to okay, we try start day. And if there is like other things like this that are kind ofract. I feel like in your because when I’ve been discussing that he can em team run the this is not a matter of clical capacity or the motor have. Yeah. So, what is preventing the most amount of life, like, presenting E from learning all of these tasks given that we have creating like a basketball. So this conversation doesn’t play out in practice when you’re doing a resident 18 on, say, like a series of C5 tasks, because as you said, in other capacities. I’m just saying in general, these of the level, these 2 things don’t kind of fit together. Like in this scenario, this is not going to be what. And most of the time is, as I said before, the capacity is not going to be that either, you can do something. But at least conceptually, you need to agree that, you know, these 2 things don’t necessarily kind of make sense as as, um, like there needs to be a trade over at some point between this. I had one way in my. yes, you are right. It’s it’s a question of, of a specific life, right, especially architecture, whether you need to worry about this sort of strong interaction between different things. Um, So, yes, he’s doing a different thing. Okay. forgetting that very I’m Right, right. Isabel like you have moder information has is already updated and they ex business. us. Okay, so Yes. So there is I mean, I guess a people will call this writing when you So it’s okay, so I guess maybe the point of you’re trying to make is that sometimes some of the information that you have it’s out of date and it’s not useful anymore. And you want update that we’re doing information, right? And like at least, according to this definition, here, there will be a form of castle is forgetting, because we are removing something that you don’t replace in something else. But in this particular case, it’s something that you actually want. That’s how Because it usually was, you know, just try is that there’s all learning. So the difference may be between unlearning and catastrophic forgetting is that usually when you have these kind of situations when you want to replace information that you have in a system, this is usually a targeting process. You know that this fact is out of date and I want to replace this. The way, I guess, when we talk about that, sorry, forgetting the way this is usually natural in practice, you don’t have any control. like as soon as you start learning something else. like for example one topic time, right? You add some data about the new tax, spect example the the capital of France is not Paris or you switch to London, right? Then you want to find with your model to know that capital of France is London. you might correct the previous knowledge that the capital artized, but you also play other things suddenly forget the capitals of all other countries and everything. So this is the kind of we forgetting. like you don’t have any control over what you’re doinging from now. Well, in our learning usually spend more as you are like surgically going there with a particular piece of information that you want. But yeah. you can me I would just ask a question about, like, for kids, I’m like, is there anywhere easy to have in randomly? Like, the model is focussing randomly, information animals, or there’s any way to decide which information you can wait before it, like, right? So yeah, so this just goes into the of this learning field it’s all about like how you forget specific things and then they like specialised unated into that trying to do that. This is like a new deal more or less. So I think I started hearing about I’m learning, I don’t know, maybe like 3 years ago. I don’t think it was a big field before that, I mean, 3 or 4 years if I’m generous. So it’s a really new appeal. But now, you know, there’s a workshop on learning and there’s like groups that are just doing on writing and there’s a lot of research that are doing this It’s actually a really big problem. I think particularly now in this world of our labs. It is very important to be able to take part of the system and be able to make it forget specific things. It it’s good for eliminating f behaviour, it’s good for security, it’s good practases programme. So there’s lots of applications of why you want to be able to un learn from a system, particular data points or particular behaviours and so forth. And that’s why that exploded as a field. There are some methods. I don’t think any of them work. So by far, nowadays, say because of GPRs, GPR that someone can go and ask, I want to be removed from this database. you use to trade am system that you have. If that usually happens and, you know, the company has to comply, usually what it does is to reference the system from stretch. because because the data is. because anything else doesn’t really different. anything else is you know. But this is a big topic, you know, people don’t want to be able to that. They want to be able to take the system and say, just align this one piece of information and give the rest as a reason. Yes. But there’s no reliability. We even know what good fees of information are available in our models, training, that is it. How can I be selected to the Muslim? He’s like a people problem. Yeah, I mean, in this sort of kind of GDPR kind of things is like, you know, say you, you put, I don’t know exactly how it works, but I never has the ability, but I imagine I have my images on Facebook and I know Facebook is using all the images to train what I find. And I say I don’t want to use my images. It’s not like Facebook is gonna go and search for all images of me. That is not what I have to do. They just have to remove this data that in their system is stuck for my account. So what about the element that are training, for example. Like how do they know? I mean, it’s just basically, so it’s a bit complicated, but yeah, if you can leave, and then, you know, for example, in a, there was a deep lawsuit from, I think, New York Times to open an iPhone. So, for example, if it becomes common knowledge that open eye has used New York Times articles to train ChatGPT, they can ensue, and if they win the case, and I think it’s a lot of the case that they want. They basically need to go back to whatever was their process that collected the data and just make sure that they don’t probe New York Times routine articles. I don’t know how they improve this import. Like, you know, you can ask. I don’t know even who has to do what. I assume you are anxious to prove that up and I use their data, then the judge would say, no, you’re not allowed to. And then OPI somehow hasn’t proved back that they corrected this and we loved. I mean, they have to preserve secrecy, so they cannot tell you what they have been training on, but they can’t somehow prove that they have never accessed the New York branch. But I’m sure there is some process, some real process to ensure these things. I don’t know what it is, but this is usually how it goes. But maybe more to the more subtle point here, in most of these scenarios is not that they need to identify some abstract concept, like, for example, I want to remove all of the pictures of myself, even if they’re taken by somebody else without me knowing that I’m into the data. that’s not what I have to do. because that would be really hard. That doesn’t mean that you can have a classified, that you can identify me, to remove all you need from all the data that they have. but they don’t have to do that. They just have to remove the data I’m going to, that I say, I have the copyright for this data. you’re not allowed to use this particular data. And that doesn’t really for 100 because just because it is you the certain but most it has that. Yes, it doesn’t fully, it just solves the legal issue. Yeah, so that’s why, I mean, and that’s sort of why we love you today, like if you have an underlying system, that say you want to believe some New York Antarcticle, you take that article, you build it together and say, you won’t have a phrase of the other models, I think they will be happy to use it, it is cheaper. I don’t even know, like, you know, this is the weird part, like, this is the legal aspects. that usually, like, because this came to, yeah, this came about from a discussion that we had in general. There are certain things that I accepted by a judge in court. Like for example, if I come up with an algorithm, I published that un or whatever and then I used this in my production to remove the math. This is not an overly report, just because I published a paper, a bunch of people said it’s a good paper, right? There’s not proof that you actually managed to all that, right? There is a you need to provide some kind of proof that that works. And there are certain techniques that are accepted by the court. And so anything that’s doing an experiment of applying is not. or at least the lawyers need to argue why that is sufficient to do theigning there. And then they are very complicated. And as far as I understand, usually everyone involved in this decision making process that is or not, they don’t really understand a human learning. So usually they just go 1st time, I think, like, if you can prove that I could, I never use your tapes out, like, that works, you know, and you have some proofs, some kind of, like, logs or whatnot, the shows that the data doesn’t exist, and you can verify the notes by, I don’t know, some kind of ways. So those kind of things right? But like algorithms, I think right now, this is really just a research area. But people saying, well, in the future, maybe you can use this. But, you know, this is not going to be, yeah, like it’s not anything that’s anywhere close to proper product. used for motivation of why. I mean… I mean, you can take those things. Okay. Yeah, so I talked about this, yeah, so I talked about this. like another issue is with like phone the life was transferred. This is also kind of complicated to the mathematically because there’s a difference between I want to recall what I had. And I want to have forecast, the forecast very increasingly means that I’m going to change what I had, because the whole poster affected transfer. Backward transfer means I’m gonna change what I have. So that doesn’t really go with the ball. So there is like a lot of people detailsail like this if you if you ever go into the I mean, this not that important because you really play when people look at the list of only one, you can that. you look at this place and you look at the behaviour of the system and you can kind of, okay, this is doing really what I was expecting me to do. What I’m trying to argue is that, well, as a human, it’s easy for us to validate these things, you know, we understand what we kind of expect from a model. If you try to formalise this into numbers, that you add that one wants to get a single number out of it, that’s where things becomes combination, because when you reduce every of this point to like a scalor, and then you’re trying to go weight it up into those scalers, things go quite around the way you lose. So this is, I guess this is what I was trying to provide you now. So, why is this, uh, the important is important because in the real life, This is partially why it field. It iss not like in art experience, so it canuced everything in a single number and then you can have a paper at the end of your paper. you can say my and is black because this number is lower than all the other numbers. particular level usually end up with like, I don’t know, 2 to 3 numbers. And that is more of a judgement call, right? You can’t really compare it directly. Like, okay, it’s better on this axis, but you need worse on this and it’s slightly better on that. And so this is good because it makes it easier to publish papers. So what happened at least in the area is to continue learning. People are focussing more on what is the idea that you prop about. it is an interesting idea it works as well and they care less about the numbers because they do not know how to do the numbers anyway because I have you know there numbers. It’s bad because compared to our IPLs is really hard to know how much progress we’ve made. If you look at your benchmarks, the other way to find differently, you have all of these numbers, you don’t know what time to care. So it’s a little result of everything is a little bit more messy. Everything is a little talk better place. So just it’s kind of the flavour of the of the, I guess. I mean, this flight is not important. This is kind of taken from the beginning of the lecture, more or less. I just wanted to say that part of this continual level problem is somehow specific to the primalized culture networks. Like, if I have a thing as they are classified, that they have in the 3rd day. you don’t have this issue of catastophic forget. No, you have a data, you have more data. Like it doesn’t matter which order you are data, it doesn’t matter. It’s easy to remove data. So this sort of thing, some of these unlearning issues, some of these kind of sort of forgetting issues and so forth. I start specific to neural network. And for other kind of systems, you don’t see that. And that’s why maybe you are discussing at some point that it’s interesting to think about the mixture of these things. And I guess this morning, we’re discussing about transformers and how, you know, if they do things in the context, I mean, just put things in the context, you can remove them, and that’s the way you want learning and that stuff. It’s because the context of a transformer is like this 10 years neighbour, right? So, you move to order the food things in the context, it doesn’t matter. Like the transformer just looks at the context, it has that point in time. So the situation of it, it doesn’t affect things. You’re not going to forget the thing you just put like the cycles at all because well it’s in the context. It’s not like you thinking of it as And yeah, so just to say. So there is this sense that, you know, we’re talking of like, whether we want to use a primatic number. I think there are these different dimensions that make this system behave quite differently, which is interesting and sort of makes a good idea to be able to make them together when you can. And as former… maybe I would repeat that when I go to the the lifeare of Transformers should be the next blocklo up of this. Easy some sense, maybe already that. It’s already like a mixture of non medicalical apartment because it is that you want an education that you can have. So this is just same thing. Okay, so now… This is sort of a great design slide. So I wanted to motivate how I think about Virginia writing. In particular, I had one definition, which is, was an upgrade understand, and I want to explain how I think, at least for neural networks, straight away and set. A big part of the issue is brilliant and stuff. I mean also better right? Actually, yeah, where we find industry. So let’s take the bread I usually have right like let’s take the 5 minutes red job and then when you’re back, I’m gonna know what. Where is that? Um… So, I wanted to, uh, want you to, like, the, the, the mechanics, did you change practice, when you do a year? So, I guess, you know, whether you discuss before, the way rated is at work. is if you go get the radium on your data or for your tires in a private data, and then you start tracking the variables from the wings. So there’s like 2 steps here that I think are interesting for what we’re doing. So the 1st step is, you know, if you complete it, it is brilliant over a media batch. like let’s just assume you’re doing full bad weight itself, right? You’re comparing these variants, or you take grandpa in your data set, and then you’re an average data. That’s the one. the other that you are buying the you can look at what happens in the private. Sometimes the part is, you know, you have some down, you hydrate there. And then the variants will do one of 2 things. So either try to push the radi up so try to increase the magnitude, or it will try to increase the maximum. And this is something that happens for it each, but if you look at it, I need to respect a single weight. This is what each example in parallel is trying to do, right? Because it great the computer always starts in the matter. So you can think of like, okay, you have a bunch of examples that I’m trying to make the weight larger and some examples that are trying to make the weight smaller. So when you’re averaging, it’s almost like they’re playing this cargo working, right? Some examples are pushing to increase the weight, some examples are pushing to increase the weight. And it’s sort of crazy sort of, I, I always talk about dynamics. So you raise sort of this kind of push-up pull coming from different examples. And the way learning progresses is that you find some kind of ecliprium where the examples that are trying to push the weight up, and example, they’re trying to push the weight down, kind of get to sort of have the same power in how they do the push. right? And then this is sort of how learning happens. So this is a little bit small as a lot, but like this is learning past one, pasta, together. And if you look at different radiants, right, the blue, you see that they oppose each other. But like when you sound them together, they cancel each other and they tell them. So now to understand what is the continuing problem. Imagine that you remove one of these force, right? If you remove some of the examples, some of the examples are not in the data anymore, because, you know, that’s fast one as you don’t see anymore. What happens then is that you only get a 2 grade there that is pushing this way and you don’t have the right way that it’s pushing that way. So because of that, like this type of world game cannot exist anymore, you basically push the rope all the way to yourself and you forget anything that those examples are trying to involve in that particular way. So this is how it’s like a toy picture that is designed to describe the kind of dynamics that are going on, meaning in sort of where you’re doing radio stuff, right? So then you have to each example, trying to push the weight, you know, pull the weight in some direction. And usually you find some kind of a living area, but if some of that data is decent, you cannot find anything easier anymore. And then I need breaks. And this is basically the mechanism to defend the silence in the neural network. You do using examples. So my claim here for you is that this taco word dynamics, these activities being resolved among examples within a neural matters. And for this other part, for this type of organ to work, you need to have all the data present, or at least in expectation, all the data needs to be present. And that’s why theI assumption comes from. So once this idea assumption is not distributed, you basically brought the equilipium and you cannot find the same solution. And that’s where kazami forgetting happens. Because you can imagine this task palatas too, if one disappeared, the other task disposed of the ways on the way it wants to be. So this is like a metal future of how this happens. and maybe just add to that. Another sort of side observation that we can make is that there is no explicit composition of knowledge. So when I learn what it is, I don’t learn, what it is as a function of it. So I’m not building on what I have. It’s more, there is attention, and B is everything that A is not. So that’s how you like. If you learn that classes, you learn them by what they are not, not by what they are, you know, and they’re just forced to do all of them. And, you know, if it’s not BC, D, E, or F, then it’s A, because that’s the only one that it does. that’s how it works, right? And you are, you’re putting here, that are actually living this sort of life in this, uh, this China. And this is a problem. I mean, this is, sort of, at least on the car, so we for the inside. in my view that this is a fundamental problem and it’s It comes from danger is that. And I said that because… We typically use grain and set, but we could use other learning rules, right? So my technology should be that there is potentially another learning rule that you can imagine, another way of learning, where you do create assignment differently, and that will not necessarily suffer from this problem anymore. but I don’t have to say the other. There is another story here that I found kind of interesting. Mr. Baker with this and again it Italian is very connected to this t of what day. So What does this mean in practice? So here it’s an experiment, so this is our thought experiment. We have lunch time exactly is experiment. But you can ask the question of what happens if I want to learn how to play paragraphs, which is a game? start after we have the bigger race, right? In this particular case, we’re looking at zerds versus board. So inside fil ra there is very game. And one thing that happens is we’re looking at what happens when you’re trying to learn to play with both races at the same time. You start the game, you start learning with both cases. What happens is these 2 races are fighting each other. So if you’re looking at ingredients, similar to how our generals have a word, then gradients cancel each other. And, um, because this is what the 1st one learning, but this is something supervised, and this one, I’ll explain it around with the BPC. In we have this photology that in the beginning the strength of your signal is we because the quality of the signal that we have to learn the direction that you have to move in is given by theology of the data that we have. So when you start learning in ourL, you do not see any work so that you don’t know which actually have to move. So your signal is weak. The scene will start becoming stronger and stronger, the more you learn. And then as you learn more and more, they see what becomes with it, yeah. And that is because when you are approaching a solution, when you’re converging, the radio starts vanishing. The gradient usually is small in the beginning, much was provided, and then small at the end again. So what you notice what you’re doing is what it does, um, learning what you’re doing is pass together. You notice that by chance, if one of the tasks, for example, when you talk to people, you’re bored, is doing slightly better, then you’ll have a spicy with that radio. And it’s a chance of the other task. And what that would do is that basically you add up only like one task. because this task is always strong than the other so the r will always oppose whatever task is doing. So what I mean, these plots are kind of trying to show the trajectory. So okay, so in this accent this is learning to very as part this is learning to play task too. And the observation is as you write this money has game, you notice that I mean, it depends on the scene and what that. But usually what happens is you learn how to play one task and it’s only once you learn how to play with their task and the very has vananish because you convert. You start learning from play the 2nd class. So, uh, So the idea is that this kind of type of dynamics and it’s kind of kind of something forgetting, it has impacts outside of the female learning. I guess this is the story I had to tell. So here what happens is you’re trying to learn task together. And because they’re feeling to each other, what you see in practice is that even though you’re seeing data on both tasks, your 1st 3rd task one, and after you’ve learned that’s one, you’re actually starting in task two. And at the end you know how to play both doess. I know that’s good. But this for us it was very surprising. So here the point is, again, your learning doesn’t work. While people actually do this, they do multitask learning. So you can’t learn just oneas at a time. like give all the tasks at the same time for the model. So education is ID. But even though you make everything like, if you look at the progress you’re making on a different things, you actually don’t have to question. So this is what happens in Arab. And this is what happens in supervised garden as well. So I There is a very technical paper here from Andrew S. and I did a lot of talk about your work. You don’t have to read it. but this is looking at these linear models and shows that mathematically speaking, this has to happen. So he does some fantasy math, and he kind of gets this old need that describes how learning behaves, and he gets that he has exactly that kind of pathology. Yes. And the theory is like, I don’t understand how that’s, that’s, those are also a weakness of continual learn because it shows even when the tasks are presented to the model simultaneously, the model learns the potential. Yeah, the community is not to be attended to extend this to all of them. No, so okay, so maybe that might start with the plan side. So this shows that, yeah, you know, you want to show these, uh, okay, so I maybe I haven’t finished my password. So, uh, yes, so the issue here is you like, you show everything, as I say, that, and you like continually, to a nice question, you want to, that, right? But the problem is if you show them the question one after the other, then you don’t like all of them. Because you have catastrophic forgetting. So as soon as you start, like, in protocols, if you don’t still do replay on, on, on, on, on Z. So this doesn’t happen. continueign you just like the last task and that. Well, like if you learn all of that together, you end up knowing how to learn to do both, but they learn technically the right surprise. But you just still need to see all of them all the time. So I think we saw the part time. So there is a sorry. So I put this right that is less important. I nothing on the go for. But here is the observation is that as long as you have the deep party, but you check on the shovel one, this will come more clear, if you are very, very, very sad paper. We have exactly the same dynamics. I don’t know where I feel on side on the side. I’ll go back and I couldn’t do it much. So there is this discussion but I did this discussion for a long time where did like the the method is considered to be feelingly efficient. So learning anything in your network requires lots of pain dialogue compute, right? So the question is, why is this learning so inefficient? And what of I play is that this shows a reason why learning is an ambition. So whatever you learn me, you can always decompose that into tasks. So anything that you compose into the task and then those in some tasks and so forth. And as long as any of these decomposition shows some interfearance between the things you arere learning. What that would mean is that you’re going to learn this little performance about your time, do you going to learn this financially? But even though you’re adding that you question, you have to show them all at the same time in an IE way. So what that means is that you’re going to waste a lot of compute and a lot of updates, even though you’re not learning, like even though you’re not learning 90% of the data that you’re seeing, you still have to see. And that’s what the efficiency is happening. So you have to see the data because otherwise you have to start working, right? If you don’t show the data that disappears and then the whole community doesn’ exist. But because you’re not actually improving on that task, it feels like wasting computer wasted data. So I think what you like to do is to be able to learn what on this task at a time. And that’s sort of how humans learn, right? Like a build going to like. like one thing and you build a letter and the next thing is a what. This is not a machine, that’s it because we don’t like that, they have this extreme inefficiency in how they concentrate on me. This was kind of a p. This is not potentially the only reason why things are inefficient. But this is one reason I are efficient. So inefficient, because you learn, you have a lot of interference, there’s suppresses the learning of different aspects of your data. And you can only learn those aspects once you convert it on whatever you’re learning now so that the interference in there disappears because is that someone here? I don’t understand much about the game, but is it 2 Asians play against each other like when I playing in the third world? No, they’re exactly the opposite. So, so, no, so this are like the agent playing the penalty, I guess. the agent, my objective thing I’m playing with, I heard the one is the opposite of my objective, if I’m playing with another one? No, no. Itides independent games. So there are both I guess the game itself. No, this is another month, it does, not, not, not ancient kind of, really, this is a big, not Asian. I was feeling if is like as you some game and playing each time is it and player give. So it has nothing to do with that. So these are different distances of of age and you just compute one way in America. So, But they’re like, there’s the same, like saying, both there, there’s humans separately. The human is played by the whatever the game does, whatever the AI behind the alien is. it just sort of the same age, trying to realise everything one. So yeah. have that pain. then there is all healthy. So there is one with us. Then we relate to that, uh, the Indian, then we see the variance from, uh, the, to just join the maids. so make their accounts for the radi from the other day. Yeah, so yeah, exactly. when me tried to replace differently but you’re kind of right. So what happens is you learn that one, the one the being so that way disappears, right? So then when you take the next step on task two, there is not that far, so you’re going to move the waves for a Jew, maybe more than you should have. So in the next step, because you’re still seeing the data from that one, then maybe you have to show up again and pull you back if what you’re doing is breaking what that’s one was learning. So like if you completely removed that squad, you do everything sequentially, as soon as you start lifting away from the solution of that squad, there’s no force to push it back because you don’t want to visit us one to see that, wait a minute, now you are changing, wait a minute, you shouldn’t have. But when you’re doing multitask learning, even though like you’re not saying you’re getting that one, because you’re staying very close to convergence, like any more motivation will give you like a small variation to tell you that I hope you’re not supposed to spend parameters. And that’s sort of how this thing works. And I find magnetic area. But like, yeah, so that’s the whole secret is like, when you’re learning the 2nd task, even though it’s looking like the only thing you’re doing is increasing performance on task one. Actually, the data from the previous task is really helping you not to break it because it gives you something every time you try to break that squad, it gives you something to come back and then break it. That’s kind of the meanery behind this. And this is like seeing, so this was at the task. I thought this is like a bunch of papers are not going into super details, but they’re kind of about the same thing. So there’s been this research, right? They’re looking at what happens to a particular data point in the data set. So there’s you know, which of these papers. But like, for example, they look at the internet, and they in fact, formally classify the image that, during training. And it turns out that you learn it, you forget it, you learn it, you forget it, you learn it and forget it and so forth. Because sometimes you get how to 20 of these events. Like you learn some later, you forget it and you learn it after burning times, right? So it seems like there are some people, example, that you learn from the beginning and they stay alert, and some really hard example, that you learn at the end. But you also give these examples in a meetup that you cycle for them many times. And this is kind of showing this pathological fact, that parasolic has, we didn’t start at diving. So if you’re not any kind within your life. This is your start in that rest that. You’re just checking to see what happens in which of the apples. And you see that you have a solid behaviour, you’re right, for example, right? So this is, I guess I’m just trying to study that this is a, this type of issue that is sort of very, very learning process or not only your average. Like, engineer and is trying to study that, and then I’m even talking about, you know, learning, but this is not… It’s not something special about the front of all that you set up, you know, like, It’s really around, basically, itself has this pathology in how it behaves. So these are the one example, this I example from eating that does this. There is other words that are showing the, yeah, so the other words that are showing this kind of relative to this band are going to lead up, they show that you can throw away a big chunk of your data and you get exactly the same result. For example, for C5, you can throw up to 40% of C5, and you still get the same result. So 40% of CPI is sometimes redundant. The way you find those examples that are redunded is actually quite expensive, so you spend as much time finding the data points. You spend a lot of time finding the layer points that you spend in this, right? But then the other pathological part of this that kind of tells something about the learning other way, is I increased the thing I said by half. But then the amount of for kids that I need to put to learn the task for which family does that is the same as if I had the full data set. So basically just because you throw away this data, somehow it doesn’t happen to learn faster. So I’m saying this because, you know, you might have this feature in mind that, okay, this data points are not useful, that the radiants are very zero. So like I’m just wasting computers, but every time I see this data point, I’m competing again at the 0 and I’m waiting. I’m done with anything. But it turns out that if you keep all of the interesting data points that get information, the number of updates that you need to do is always to say, it doesn’t shame, which is I don’t know. I found it very surprising. I don’t necessarily know if I can give you a good answer of why this is happening. I’m just bringing it up, like, because to me, this is very surprising what I did is, and it might come back to this and try to figure it out. So anyway, just a sumaries sl and says that people forget specifically it’s all about sort of dynamics. And and you can, from this perspective, you can really find continual writing as studying, something to fit the science in learning, in particular, you know, towards toxic Asia, you send the new networks, and if you solve this catastrophic forgetting problem, which is not the full CL problem, just as part of it. that this would mean to replace this part of our dynamics. Because if you don’t replace this, there’s no way of solving it and if you solve it, you might get very efficient lighting in general. So this is sort of a practice. So this is sort of a hope. So this has never happened. So this is sort of like, this is just like, this is what people continue learning what to do, right? So that’s difficult. Hope and the goal is that if they figure out another way of doing training assignment, that doesn’t involve this full and bad. They might get very efficient learning that maybe applied anybody. Instead of having all the tasks playing with the same parameters, can we make the neutral network some sort of like modular? Can we make it modular? Yeah, so there will be one direction that will go, so there is a, there is also like a danger there. So the issue is if you eliminate volitary suddenly without that interesting life. So let me give you like example that the say so this is sort of like an extin thing. But let’s say instead of a, instead of what we’re doing is for each example in a data set, I use little parameters. So like if I have a shadow model, like I say, okay, I have a beach, I have a weight, I have as many human minutes as data points. so I have, you know, there’s many sol in my weight matrics. So each data point, there’s a problem. and that’s sort of what is gonna like. So if you run it out better like this. So you know HDG, but if you’re a stranger, HD, you localise your HDG, right? You’re going to get 0 training terror very, very fast. you can run very fast. And what you’re actually learning is to memorise the data point in the col of weight. And what you mention the system that does not generalise at all. Because this is basically the fabulars, I don’t know if you guys need triangular stuff in Iraq, right? In the example you don’t have to that right, you just directly put in the so this would be exactly like a particular approach to learning about supervisor money. There’s no interference, there is no generization. I think that critical point is a lot of distance is example, basicallyically. In one sense, is a difficult point. So, and I think I’m always at the point where you start doing that, like psychiatrical technology. Yeah, yeah, yeah, yeah, yeah, yeah. Yes, this is I think this might be beyond that point if I’m honest. I mean, this is just because the way you’ve restricted the learning process happens such that you can exactly start the data set. But I’m just looking at the number of bites that I have in my model and my number of device. Probably I can, I can, I can represent money for some issues. It’s not like the the mobile is exactly the same size as mine. But when it’s close to that, maybe, in conceptually. Yes there was another question. explain fact forget it connect to pass im balance. So glass and garlic really have an effect. I mean, I mean I haven’t focussed too much on pass, but like if I have classic balance you have an fact particularly when I’m doing multitassiding. We’ll have a effect on which of the task will be planned first. So the glass that is larger, that will dominate, that’s going to be the one that, so in particular, in this stuff, sorry, in here, like this learning how to play Zurich and how to play proto, it’s kind of people with art. So there is no imbalance here. So what you see in practice when you start running these things is that you see talking about like a chance. like we look at this lot, you know, like you haven’t many red person no, this way as we have that go this way. So depending on the rather scene, the way you start it, you recommend it be sort of the task and there’s a task, the 30 days, you learn it, and then you have some other one. You can have some imbalance, then you can actually predict which task you’re going to be like a person. And that’s sort of where like for passion balance of play as well. And in the worst case scenario, the Imanas is really high, you might not even learn the, the, the, the, the, the. But yeah, glass imbalance happens into this. And you know, I mean, this is a very practical thing. that when people were training system t to learn all that time days, they would see that they would not learn all other ideas from the same time. you know that and BBC question. So you have if you have and tasks and you don’t want classifying You see, you learn that to question if you they all interfere with each other. But that’s not always the case. So what you’re gonna see is you’re gonna see sub group of us learning something behind and then you know they start learning and so forth. But you’re going to see some kind of speciality. But you’re all, it’s all data independent, and I think that it’s… So it depends on how much interference you actually don’t observe. If you compute the better in the gradius for each task, and you do the cost sign between that, you know, how many times are just the signs going to have, like, the negative matter, like, any matter, like, you know, like, you know, like, Yes, this are this side especially. Okay, so now this is more about framing of the continuous problem and discussing some of the intuitions that I have behind how it works and, you know, things like that. So now I’ve got to jump into how people have been approaching these problems you kind of get a sense of the kind of solutions that are out there. Yeah. I was just thinking about, if the purpose is to make AI play, like 2 weeks. then we don’t remember everything we learn. No. So why should they be bothered about the improvement? So humans do have, um, especially like the whole, okay, so the whole kind of something thing being actually started from communation on neuroscience. It’s I should remember the name. I get them with them. It’s like a paper from the 90 vari of apparently. And you I was really trying to look at ads from back that and sort of human subjects of how they learn and how they forget. And people forget this book as well. But they do not forget catastrophically. So like, you know, the machines you forget immediately for humans is sort of slowly decaying over time and it takes you a certain amount of time to forget some pass run. I mean obviously there are factors in psychological factors another things that I maybe I get past or what not. but typically people that is sort a very nice third and how they forget that you do forget it just sort of sort of a like non- catastrophe. So that’s why the way kind of is there. Um, In the early stages of the here, I mean, I guess the main focus that you know are something forgetting because among the other teams is the one that’s easier to formalise. So if the, the, it reduces the, this is the fear of between variance. And because of that you can now start to be algorithm You can start looking the gradiients, you can look at the 1st time and then do things. But I think, you know, the community are sort of engineering sort of how people expect this to go. We do expect a decision for need to forget. So the goal is you want to figure out how you remember, how you force systems not forget, and you want to learn how to forget. And then you want to have full control over this. and build a system where you control of forgetting happens and then magic. I mean, that’s sort of the high level picture of what people are hoping, right? So there is an online community that is learning how to forget a specialised place where there is a community waiting a kind way for getting that is learning how to force system notify with anything. And the whole plan is that the subpoint distinction come together. And because you have full control of both behaviours, you can sort of decide what is the optimal curve, what is not afraid of, how much you control, I forget. Whether there is the right way of approaching a power from that, I don’t know. It could be that like you have to do both of them on the same time, otherwise nothing worse. there is another there another techqueirt’s just coming up in you’re trying to do something like people composly play, sleeping in sleeping off people there it reduces the fire you see know things someow. And then when you say that’s makes sense, or rather, it is more of a, because learning is, um, ability to, this needs to be ability to maintain some things like, and then the human brain can retain some things to some certain level. And then when we transfer them to the AI systems or computers and all that, like, the memory capacity, the computational capacity, because the brains have 1000000000s of mirrors and all that, but AI models, when we increase the neurons or something like that, computational complexity and all that, like, what would you say, easy? Should we should we truly model the slipping mechanism of sobiology or should we increase the computational competitors? I mean, I think… Okay, so 1st of all, like, if I’m really, you’re the most dramatic person, it really depends on what you’re trying to do. Like, for example, if, like, if you do me, if you have a way of compressing the inference, which is sort of this kind of sleep, that it does, right? It kind of controls, the capacity in some sense, and then it kind of compresses information back where you sleep place, then you can have a cheap inference. like you try to build the product or you need to have a GU 1 like you want disability. I guess it depends on what you do. If you’re looking inside, um, I mean, so you want to do, uh, so now being talking is right, really agents that help you to do scientific discovered. you want to have this transfire around then we can ask questions. Like, you might want that system to behave very differently from you because, you know, maybe the kind of meditations that we have also limits us in terms of how we search the space and how we do things. So, you know, I didn’t imagine they are sending, or you prefer, it’s like, okay, it’s going to be expensive, but if I can afford it and I have an engineering solution to it, you just throw more computer, more you choose or whatnot. I would like a system that doesn’t have the same kind of white thoughts that I have when I’m trying to work a better system to solve some problems together. So I think it all depends on what you’re trying to do, but back to, I mean, I think in principle, that there’s really nothing, there’s a hardware that I’m trying to maybe this makes the kind of behaviour, and there’s other people that are trying to lose kind of isolation. And in principle, I think most of the time you want to do that, particularly if you want to have a system that leaves in the world, as I’ve said before, the world, that’s not what you do. So you have to be able to give the response in I don’t know how many milliseconds. So it doesn’t matter how many plasters you have. Your response to that, your latency has to be low. The only way to do that is to keep the system room small. Like there’s no other way around it. And therefore you do need to have this regular sort of distillation or this compression or whatnot. So I think in most cases you probably want that. There are probably some point I taste is going to argue. I don’t want miscompression. I’m doing, you know, I can afford to win as much as I want and whatever benefit I get from now doing, this compression is worth it. But generally, you probably do want that. And a bunch of other things are doing that. And they even use the technology while you sleep and so forth. There is a lot of communication in your scientists that land this topic of working your learning. but even people working between United science and they bring sort of all kinds of ideas from your science and Um, So for example, like a lot of the work that I’m going to describe here. So the field is relatively young, as I said, started in the 2018, 17. so all of the papers are from that onwards. In machine in terms that’s ancient, but in reality, those are really decent stuff. But anyway, like most of the early works from that period up to more like a couple of 3 years ago or so, tend to use this task incremental, I guess, some people call it, or like piece-wise stationary distribution, but that means is that you have a task one, but everything is IE is like a normal learning problem. And then once you learn the task, you move the task too, which is a different data distribution, everything is ID, you learn this task, you move the task 3 and so forth, and you just want to maintain performance is more through this task. But every time you’re learning one thing, that thing is I. So this is in difference. So I was giving you an example, I was saying, like, what happens, if you see all zeros, that all ones, like, if that’s an idea, nothing is I need. I take anytime we go and look at it it’s all the same ag. So that’s not ID sample from the overall distribution, right? So there is a difference in higher than this. Here’s what I’m saying you learn to classify one and two. So every time you book, it’s like, it’s randomly samples, ones, and 2, and you’re like, it is binary classification tasking. and then that’s task you classify 3 and four. So everything is shuffle we need 3 and 4 and everything was fine. So this is the traditional setup. Why in the used? so a lot of people are kind of comparing the lot about it. So it is not You know, this is not necess a natural setup. It’s not like a world behave like this. This is children because it allows you to measure things and it allows you to control things. So it’s more of a, the protocol is very likely because now we have a validation, loss defined for every task, every class is a normal, supervised task, so you have a foundation, loss, you have a ten plus, you can control whether switch happens, all of this stuff. So it’s really convenience. There’s nothing else today. As of a few years back, now we’re moving into a space where the new benchmarks are more about like a continuous brief for the data distribution and all this kind of stuff. So people are moving away from this framework. This is just sort of where we start because most of the Americans I’m presenting are going to be within this framework. Many of them you can extend. It just takes a little bit of imagination and very sad, I’m going to be added very much. And particular, one big complaint, where we look at the time of the, make sure I’m not, keeping you guys more initial. You have more time. So particular one peopleaint that people have with this very much? that they have this priority, that we have this, especially quite some time of research task. And you know which task we are switching to. So usually what happens if you know now doing task one and you know it’s 10,000 you switch task 2 and you know you’re doing classical. People say this is very limited and this is very not natural. In reality, you never are told you’re doing that fun now and now you’re doing that school. This is just something that involves naturally. I need to say that this sort of task interests us. the way has to make that this process of differing the task. It’s actually not that hard. And there is an easy way of doing it. And this is not an limitation of these outaters. So in particular, the typical way in continual life, when people are going to have interest, is the train a generative model that is trying to learn sort of the distribution that you learn a lot. And whenever you have a new data point, you look and say, how likely is this anomaly, this data point that the distribution I’ve seen so far, and if it’s likely, let me tell you, this is the same task. If it’s not, you say it’s a different task. So this is like anomaly detection, you know, so basically whenever you see something that changed drastically, you just assume you’re going to be moved into the new task. and then you can use this to interview further task. So you can do this in different ways. So you can have the generative model that gives you like a distribution so you can add how likely you data programme distribution. You can look at the error. And you see here as a being spike in error. So you have so your error goes down and you’re like a point one and then some of your error becomes five. you know something can happen and then you assume, okay, there was a task which norm talking here. So then you can react to that. Or at another perspective of this that I find cool, is you can be ancient century. trad people say that most things are environment century but it can be agent centric. So the approach here is something like you learn. And you look at how much your parameters have changed. And at some point you can say my parents have changed so much. I want to protect this. I don’t know if I’m switching task, but I’ve learned enough. So now I’m, you know, starting my computer and job starting to like for the web so far. And then continue learning. So here you don’t depend on the environment, you don’t care to see if you can do that is not attention in the environment. You’re just looting integrity towards yourself and you’re trying to perform how much you’ve learned and you decide by yourself, you might learn sufficiently. okay, now it’s time to stop and constantly take whatever so far. Regardless of what the environment person needs to do next. So these are the type of ways of dealing with detecting the current task and attacking the futures. Oh, this. But this generation model, this generation model, does it guarantee we drop in the percentage of catastrophic, um, they’re forgetting, um, they are forgetting it. No, the generic model is just there for me with that when things have changed. it is not going to happen forgetting. And it is not necessarily without problems. I think there’s some point we’re discussing how you try to model on CPR and it’s fine, SVHR, more likely than intelligency car or something like that. So you’re going to have those kind of issues where even though the distribution has changed drastically, because your normally staying is some way, you’re going to believe that no, actually you’re still in the same task. So it’s not it’s not without problems, but because really the issue is computing learning and distribution of the data is not easy. Like the density estimation problem So we’re just trying like, you can, I mean, I can tell you what, what is, so this is the scheme we use where we need the one algorithm when we were playing that, right? So what we did is we use binormial wear the pizzas. That’s not alternative model. And the issue, I mean, the the only reason that work is because in Atari, which is kind of funny is you take the whole key and you go the average pizza or the savage on the colours, you just look at the average mix, you almost get temperature energy in your brain for the average pick up. Like the images look so different, but you can just average everything and it’s still going to be off of very separate well. So therefore, that, you know, Tari is, you really need to actually get, figure out which attai game you play. Because they look so different that like any single classifier will tell you using that. But in general, this idea of the majority model, it’s a little bit tricky. There is one problem that the Tennessee models can fail because it’s a hard problem it has to do. Another problem is the genrity model learns so slowly or learns too fast. And then again sort of mess it up and everything. So I’m just saying it’s a solution that we widely used and if you are learning to work with this in Europe, but it’s not necessary without all. It has lots of intel issues as well. There is there been traditional here. So as I said, in the last few where people have moved to this more like smooth transition and since the task I was telling you. So it’s not found, but they have computers, they have buses, they have cameras and they have different things. And these are supposed to be through certain times. This picture is cameras over the years, right? And I supposedly there is a change in the distribution, as the years go by, how left of my new, or how new, and so forth. I don’t really see it. maybe they don’t have a suffision large Maybe Windows exchange, then it is something like that. But anyway, the whole point here was that you take some consistent thing, like postplay and probably depending on what is in passion at that point in time, you start seeing different kinds of costumes over the years. And this creates some kind of non-stationarity, but it’s smooth. It’s not anything interesting. There’s no boundary where you can say, okay, no, this is where things have changed. So this is just for for, you know, to cover our bases. So this is typically usually called online contin learning or continual online learning. yeah multiple names, music paper. But most methods that we’re going to talk about that are in this district task thing, they can transfer to some alterations. The other difference in this space compared to your original space. So when you have this discrete task with boundaries, people usually worry a lot about philosophy forgetting. In the newly teacher, from a couple of years ago, when they study this problem, they usually do that forecast. So certainly that is the same issue of the computer for numbers. And then which number are you going to emphasise in your paper, which one are you going to be most proud of? And in your paper, everyone was talking about that. So you forget the number, no, everyone is talking about their 4 transfer number, and they’re kind of invited. That’s sort of the community 100 about. This is the last flightide before a job with the solutions? I will do this slide that maybe the 1st solution and then I’ll start. should be more enough. The other thing that I think is actually quite important. is that almost all the algorithms that exist out there, that they were philosophic forgetting, They make this multitask assumption. And this one is the past assumption is is implicit. no one talks about it we have a paper now that youre trying submit trying we is playing this assumption. And it’s also sometimes weird because it’s assumption is that the agent is much larger than the world. So the assumption here is that if I would have all the tasks at the same time and no one has money, then this would be off. I will be able to solve all of them. And then what I’m doing with material, I think is can I see the task part of the gather and get the same performance in the market. So this is a interesting assumption that all the mats are making. that this multitask solution is optimal and it’s the pastity group. And what we’re trying to do, whether you’re learning, is somehow for what to make this multicast solution. is you don’t understand why you 2nd going and point comes from the best world. because the assumption is that if you want the task, we can repeat that on the task. So you have the capacity to, if the world is the the secret of passage you’re seeing, like the fact that multitask is popular, it means that we can do that all of that. It’s not a capacity issue. It’s just that you learn that first. also at the bottom when it comes to computer unless I’m having some kind of some of recognization be able. I assume that a new gave it a fit in even from. Yes, yes, there is a, I’m just saying that by, um, I’m not saying that the, the capacity is not an issue or it’s not an issue, what I’m saying here is that when these matters are saying that the optimal solution is a multitask solution, that they’re basically implicity assuming, because they’re basically assuming that the best thing that you can do is imitate the amount of task. It not even that this is the buy I need to compare to What they’re actually doing is you look at the multitask objective, which is lots of tasks, one plus lots of tasks, and they’re trying to approximate it to that point. They’re assuming this is the optimum name. And then by my exception, I feel that they’re assuming that you’re kind of like you’re going to have the ability to learn about this task. Like you could argue that actually the best thing you can do is to forget aggressively because you have a capacity. So you can learn a new thing. But they’re not trying to do that, right? They’re especially trying to memorise everything that I see. And then the assumption that this is optimal, which, to me, says that, you know, they increase the assumption that we have enough the question to start about this. I’m not sure that. can be what’s wrong It’s not wrong. I mean, almost I going into this. There are situations when you get a little better than mountain task. If I already have to think about my reach at right time and the same moment and I can design and rule by time. Can I do anything because I’m mult. It depends on what you measure, but if you imagine, like, for example, something like we grab, it would be, how, well, I’m doing a dishboard in time as I go through all of this. Like sometimes I need to learn something and it’s not going to be useful in the future for a long time. It’s actually a matter for a dentist so that you are more classic and intervent more. And whenever you get to learn heat just ask here to bearning from scratch. So it’s not always clear the best thing is to retain all the behaviours that you have throughout until the end. Sometimes depending on how also the behaviour, you know, see it in the future or how much it will interfere with whatever you have to like next. Sometimes the better thing is just forget it. and move on and relearn it from your reading. And you get contracts, easily, kind of coroner taste sequencies that have this kind of structure. And then you can show the multitask is actually really sub-off tomorrow. And even just a simple functioning agent, if you’re looking at the performance it gets, it will be better in the end because it doesn’t have to deal with that. So there is a sense. Okay, so this is… So that okay, so this will be the last slide probably. So this cat running from one side to the other is kind of kind of thing. But you can have this task where, for example, So you have, you have almost a squared wave where you 1st to interpredict only one, then you need to put the marathon and I try to deploy the marathon. And when you add this mechanism that ties to help for you not to forget, you basically end up with this blue curd that in the end end up having some kind of meal between those 2 values. While you’re better off forgetting and writing everything from scratch, it doesn’t matter. It is not the best example, but it’s a toy example. kind of trying to administrate this picture that you 2 guys contradict each other. So if you’re in the same state, that’s what I said, you should go left, that’s what I said, you should arrive. If you’re trying to keep both them at the 2nd time, when you investate, you’re not going to know either left right right they’re going to stay there which is worse than either of the choices. So you better just take the choice and see that is the wrong answer. So that I think that is the sense which Monitas the menu is not always optimal. But it is optimal in a lot of instances, and it is sort of what most optimal is happening. If you are just trying to inspire on that they show that everything has has a cho case, everything has expiration of doesn’t practice. And maybe this is the algorithm that there is widely used there. But this is how we start by. So why do we have so this is a whole family about items? So what do they do? is you have the one that has loss? A, a, a, a, C. You don’t have access to lake and B. So what do you do? You do a 2nd another thing, right, question of those tasks, and then you keep the cash shares. And the heat is the ashes. is the ashes are compressed for all your data. Obviously so that the h might be more parameters than your data. But this is sort of a prototypical approach. So the pratypical approach is you have to learn it as objective. There are some thoughts that you don’t have access to. You replace that by a regular eyesight, which is the right time, which are… So that’s how you maintain that that normal day, right? within that over day, you have to throw away some of those people that are pulling on the road. You have to reply that by something else. And the class with the regularul rizer and the regularul you all of this mathod like lasting representations is not the intelligence, learning without forgetting and so forth and so forth. There’re just different approximations to those stars. Like you just talk about I mean, is it the same class. This is one of them. I mean, like the 2nd another, hey, like, you can do maybe another approach. Yes. Do we need to know which waves are important for each task? Yeah, this is what the Asia has. So the Asia, it says, if I’m changing this way, How much will, I’m the very able to do that. gonna tell you the same the H is going tell that after the 2nd already because the is especially exactly that. If I checking a,000 it is into the parameter and makes sense. But there is a finish of I mean, this is a specific analysis metric, right? That is the definition of what you trying to here. you you’re looking at a particular when you’re trying to prot that you’re asking your question of how I much is this backing the l And he’s attacking me a lot, then you have a time constraint. So you say, okay, this way has to stay the same. If he does, if the loss does not change, then he said, okay, this is the way you can use the light, because you can change it anywhere you want, then it’s going to pass. So we do this approximation while changing the weights for the other task, for the new task that we’re trying to learn. Well, doing this as a switch point. So we learned as one, after learning class one, we stop, we compute, we do this sensitivity analysis, to see which are important. And then, we fix those weights, and then we continue learning tasks, too. but you don’t drink of your the analyst. They have they said, I mean that obviously does I was saying, this is optimal, or you shouldn’t really compute them. You say this is the other way. In terms of like regularising the race in a continuing learning problem. Is there a way to write? Is it at the list, no larger, as a brand? Yes, a tradition. Is there any way, like, the way that is, the previous environment without affecting the performance, etc.? I can say it’s not, you know? That’s a very good point. Yes, there well, it depends a lot on the church factage on that. But it turns out that, um, yeah, we can, we can cancel our, uh, tomorrow’s all, but it turns out that, like, for example, if you take a lot of money just to make things easy, right? It turns out that the norm on each layer doesn’t have matter. There’s another fact where you make a decision. Actually, the only thing that matters is the direction in which you are very separating. That is the only thing that actually here is information in terms of your representation. So one thing that you can do, you can renormalize the weight of every layer, is just the top layer you come to normalise. So the way you do it is, like you noticed, if you remember, in a real problem, you can occupy this layer by alpha, in the one by the manasota. So what you do is you multiply the 1st layer by one or more. multiplied there on top by the north. So it takes a lot and then you keep pushing around halfways. there are there and then it goes to the top player. And now all your weights are normalised. So this is why we want, please, go on with any functions. Doesn’t affect the performance of it. It doesn’t change the function. It’s exactly the same function. So this is a symmetry that you can exploit you haven’t changed the fact. Well, people doing practice is sort of the non-drastic thing, which is you independently project each weight to the to the units here. So you just remove the norm without correcting for it. So and that works fine. I mean, now we’re getting in the function. This is that one, but it works fine. And then this is a practical I think it’s for great standardation. It’s the same you have rent on in a way on like we another one of the students for wternization which is not.
LECTURE 9:
Um, so, so the high level, like, I mean, there is no fundamental detail, it depends on linear models, and it’s focussing on theory, and it helps you on the emergent bound. Contin learninging is sort of planning or sort of like a new name that there the of in 2017 um and is really focussing on the neural networks. Um, and some of the issues that come up, when you’re trying to learn in these rotation are setting, if your networks are somewhat difficult, you will not find that exactly in the same form in your model. So they do study sort of different side of effects and But they’re trying to ask the same problem. Maybe another dealer difference is in online learning, like the whole field is, um, centred around the idea of regret. So what they’re doing is like saying if you’re learning on this, while you’re looking like the area are it occur but you’re learning this. How does this compare to, like, some baseline? I have a straight time in that case. And this is sort of your address. This is how much worse are you doing compared to this baseline that is trying to learn the same thing. So, maybe the details of the definition of regress is not important, but it’s important for the activity that they do, but the point of that is that other fields, they quickly focus on forward transparent. So in all library, you will not be worth asking the question about forgetting, for them, because the only thing that they focus is on regret. Regret typically measures, whether there is any kind of voice transfer, because you have the baseline, which is sort of usually a simple baseline, that doesn’t do anything pressure. what you want to do is to learn faster than the baseline. That’s kind of the, it goes online. But otherwise, in terms of like problems, settings, and so forth, there is a lot of common ground and actually nowadays, there is an effort by different people to kind of try to link these 2 fields together more. So for example, and I forgot the name of the algorithm, you know, I love you, but there is um, Andrash was claiming that the CWC was granted that the here. actually there is an algat of that looks almost the same in all online liking. So, you know, there is a lot of, like, they’re assuming similar setups that they have algorithms, like, you know, replay, continue learning, has other impulsive play. So there’s a lot of common ground. is just they’re completely different communities. They typically don’t talk to each other. They have a very different publishing plan. Continual learning as a employer, subkill is very empirical. So if you look at the papers, there’s usually like some pretty drawings like this, and some diagrams, and then a little bit of math, and then, you know, some numbers, and some of this. An online paper is just math. They really are just trying to prove things. They don’t care about running experiments. They’re just trying to show that, well, I can, I can reduce this constant in this bound. Let comes to the a price. Like people usually don’t like this kind of beauty works because these bounds tend not to be significant. So you could have to, or either one has a better bomb than the other, but if you run them in practice, it just doesn’t work that way because the bonds are so far away from what you’re doing, they are so loose and bound, that improving that constant there by some echelon doesn’t really mean anything practical. So you know, theority of the fielduel is working on this kind of empirical work where usually they carry a lot of more practical. I don’t know if that helps, but yeah. That will be my differentiation between the 2 fields. Any other questions? Okay, so then I’ll I’llll I’ll go into the flight so I’ll I’ll start with a recap from yesterday and then go to the current flight so This is a recap we So we started develop P We went through some ideas of activity. We went to look a little bit of document limitation and SDD and things like that and then and then generalisation for autimization and we looked at things like double des and visit recognization of HD and all of these kind of new results. And then now we’re in this module where we’re looking at continual lighting. So here the idea is that we’re not in an IV setting anymore. So the data is not sampled from an distribution, there might not even be a distribution. You have a stream of data that keeps changing over time, and then you’re asking the question, can I still learn in a 39? And in practice, you know, in order to formalise the problem and try to understand what’s going on, what we do, we actually use this pesoizationary kind of setup. What it means is that we have a sequence of tough, not a sequence. So not a lot of stream. But the seence of transfer, each task is well defined. So each class is ID and you have, you know, a training set and a validation set and everything, but then you have to learn this fast one after the other, sort of by just fine tuning. And this sort of, it’s just a construct that really helps you to like figure out what’s going on and be able to make your things and so forth. It’s not necessary that the field police and this is the problem we want to solve is just sort of a formalism that really helps you to make progress. And then we just ask that there is multiple aspects that you might want to happen in a setting like this. One of it is you shouldn’t see karosophic forgetting. So as you learn from on that to another category you forgetting makes you completely forget everything that you learned before. So ideally you want to maintain performmances you guys learned as you go to the task, you ideally you would like to start writing pastaster and faster. So you would like to have some kind of word transferred because the underlying assumption in all of this visitor is that there is some underlying structure between all of these tasks. so the task are similar to each other in some way. and you should be able to exploit that, right? So as you go from a past another you want to start like past and faster by using this. And then there’s some like funnier things like backwards, transfer, and so forth, which usually never happens in practice, but, you know, they’re always nice to have, so this is more like, After you learn a new task, you want to improve like previous tasks, because of their shared structure. And then there is a bunch of constraints, and different formulations of the problem. So even, you know, from paper to paper, they would consider different constraints. So this could go from, you’re not allowed to see any of the previous data, so you can only store a little bit of it, or you have a pilot number of clubs that you can do 1st steps, so you limit the amount of compute, or, you know, you live in the types of the model, the one that has to be fixed, and so forth. So these are constraints. And usually, you know, there is a subset of constraints that each other is considering and usually there is some constraint that the algorithm is not considering. And depending on your application, you might hear about one or the other. So for example, If you are working, I mean, I actually don’t even know what the right, I don’t. But if you’re working with a life like Dalalamb, um, Maybe you don’t want to serve a lot of data because like when you go to kind of preserve previous distributions, maybe that will become too expensive too quickly, you actually need quite like voices of data. So that maybe you would prefer some kind of restricting locally some parameters. don’t want store or you might not even want computations. So there’s all this kind of restrictions that come from the size, from the kind of data that you have and so forth. You know, in some, in some instances, it’s totally fine. You just store everything that you’ve seen, and then just then you have some search and develop and decide what to look at again. In some instances there’s probably one kind of reason. So this is like a robot that’s in the wild, then, you know, like you have, you know, finite memory, finite computer data, you have to be there to follow. So in terms of algorithms that exist out there for continual learning. So what I’m going to try to do is kind of classify them into a bunch of family algorithms because a lot of these algorithms that basically share the same idea behind it. The algorithm that I’m presenting right now, they’re mostly focussed on solving catastrophic forgetting. So as I said before, the continual learning, as I said, it’s a new field, it’s evolving, it’s actually growing surprisingly fast. So it’s sort of like kind of one of the new cool things to look at, like a lot of the new startups that are coming up now are around adaptability or continual lighting. So that’s kind of the new tram. so everyone is saying, oh, sharing doesn’t work anymore. We have to look at the adaptability. And continual learning is basically the academic field that is looking at that. I feel like, you know, sometimes they start out with bad commuters and they claim they do something different, but, you know, people have been working on this for for years and there’s some theory behind it and there’s some algorithms. And this is sort of the field that loves that. But it is, and you feel that that’s evolving. It started focussing on the process of forgetting because he is the one that is easier to formalise partially. and you know, usually you start where you can make progress and you just do what you can. And I think just do kind of add a bit more context around that. I think the intuition, if you look at the early days, and I want to author some of these early days, making it the framing we used to have is, you know, we have this list of the interactor, like we want to have, uh, prevent data, so we’re forgetting, we want to have for transfer and so forth. But we start to catosophy forgetting, we fix this, and then we move to the next thing and we kind of put them together. I think now more and more more people are starting to think that you can’t actually do that because there is a difficult. There is a lot of interactions for from these different things that do us and some of them are kind of fighting each other mechanistically. You can’t just say, I’m just going to look at kind of something for getting, I’m going to fix this, have that, and then add on top something else. Like, you kind of need to consider properties you want your system to have that one because, you know, there’s a lot of interaction within them. So this is just sort of my, my, my new feeling of where the field is going and how people are not thinking about this thing. Sorry, in particular, I, you know, and we’re going to have some examples of this. Um, Yeah, I don’t I don’t think you can treat each of these different ones in isolation and then just compose them either. I think, you know, you need to think about this more holistically, but, you know, doing that aside, so this is where most of the progress has been made in the field, in the cutter, so people are getting the part, and that’s the one we’re going to focus on in the class, so you can get some impacting point, and then if you ever want to jump into this space, you know, do research on this topic, you cannot have the newer stuff, uh, who I transfer from those parts of it. So, um, Usually, I guess people, I think this is what I really like about. Usually people split the type of algorithms that we have for. So if we’re getting into 3 classes, The 1st class, there is a regularisation based publisher. So here, the idea of, you know, if you go back to the track of more dynamics that I was trying to convince you of, um, so I feel like, um, Probably this is not the next group version of how people describe that. So we’re forgetting, but I feel like this type of work is a very, like, kind of intuitive way to understand what’s going on and I think it’s going to save you a lot of time instead of, I don’t know what it is about politics. But, you know, if you go back, let’s talk about it, dynamics thing is basically, you know, you have these 2 teams that are pulling on the road. And then, what, continue learning means is that you can’t have both teams at the same time. So obviously what happens, so you have a single team pulling on the rope, there’s no nothing in the sub, this one just pulling the rope all the way to the. So the idea of regularisation based, if you add a regularizer that’s actually in place, or the other team that’s used to kind of recreate the ideal the 3 years that happen. So what does that happen mathematically, mathematically? You basically write the multicast objective, so you write sort of the thing that you think is optimal or ideal, which is your largest solving the current task, you’re solving all the task at the same time. So if you have 3 cards, where you have the soul, you just write the sum of the 3 lost it. And, you know, assuming you have access to all future data all past data. So this would be optimal, and now you’re looking at this and you’re saying, well, there’s some terms here that I don’t have. I need to approximate them. And then you come up, we please spend the waste of approximate them. Um, one simple way to do this, for example, it’s, You take take the expansions of the terms of the task that you’ve already seen. And you say, I’ll turn out approximate businesses with a contract. Um, and the nice thing about trying to approximate the lot through the photographic. So, okay, so maybe some details just so you’re aware of why this map does not look much higher than this. So this might already look ugly to you, but it could have been even worse. Um, Because you 1st learned past K and it converts on past K. When you take the Taylor expansion, The, um, The gradient, other pollution has to be 0 because you’re from very. So because of that, all the turns disappear except the 2nd audit turn. I mean, the 1st audit turn doesn’t disappear, but you don’t care necessarily about that one because, because the grain, yeah, with respect, that one is going to be zero, but it’s not going to be a function of the variables that you care. So what you do is you take that lay of Tita, you know, be an expansion around Tita, A, so the solution that you had before. And because it’s a solution, you end up with just the settlement. And you know they’re selling for all other cars that you might have. And the nice thing about doing this exposure is that now this session that you get here, it’s basically it contains the data. So this time here, the station is computed over the data that you have. But you can think of it as a compressed form of your data. It’s the result of incorporating the data. So now you donate data from this previous task. You only need the head share, that you provide it, and the TIA, and it could be the big points where they work. And we also replace certain data. And obviously, even that it’s a lot. So, the next approximation that you do is you make this magic diagonal, you throw by anything that’s on diagonal. Another approximation that you make is instead of computing the hazian, you use the gradient squares. So basically all you have to give is, like, an expected value of the range and squares on the video. So very, cheap version of this, this is not what VWC does, is basically you take your other statistics, so you finish learning task one, that is your, your attention, and you take the solution that you have on that point, and those are the 2 variables that you. And then, The fanatic term has this form. Tita minus, the solution that you have multiplied with, basically, the exact is, like the, the t-shirt, the Hashia, or the expected grain is desired. Depending on the approximation you make is kind of equivalent. Um, So you can try to think intuitively what an operator like this does. So the idea here is that this creates a force. There is pushing you back the data A, so the previous solution that you have. So you still transpire, you have a solution and now you’re moving away from that solution because you’re wearing something new. And this is the reason why this is called the lasting white concentration. And by the way, this is inspired by synaptic consolidation, the neurons. This is supposed to be a very biologically implausible mechanism. Is that what you’re doing here is you’re adding this force that is pulling you back. And the strength of the force is given by how important was the weight that’s all being transcribed. This is basically the increation, right? So, this is, you can think of it, man you very sure. You can think of this as a sensitive analysis that tells you, if I notify you for a weaker, particularly if I’m looking at a diagonal, right? Besides, if I modifying this for a retard, how much does my loss on task one changes? And if you change it by a lot, then you say, well, I have a very strong force that is pulling me back for this particular private charge. Right? I have the strong padratic that is pulling me back. As soon as I move away, I pulling it back there. And if this, if changing the parameter does not affect my loss on the last one, then, you know, this, I have a very weave watch, right? So I can actually move quite, quite large. So this is sort of high level, the idea. Now, We have maybe, uh, maybe more algorithms that are based on that. Um, and it just like, I mean, we’re not going to go through any of that, but it just sort of, I mean, you can see it, right? You have so much freedom in how you define these things. Another way to define this is your probability point of view and it has things like VCL and all of that, where basically in that, just have an idea of what happens there, this regularisation can become your prior. The way they’re framing it there is that, you know, you have a prior to watch it with Tita, and this prior changes every time you see a task. And now you have a data aware prior, basically, telling you what could be, um, um, Yeah, one or 5 either, it also makes sense for the previous stuff. And then because you move in the probabilistic space, you know, some things change this year and you get a slightly different formula. They don’t have the same flavour of like, there is a port pulling your band. And there is a very efficient that tells you how from this poor seeds. All of these, particularly when a pilot is including deep neural networks, the neural networks. Although it is end up doing this, can depend on differently, but I mean, that. Riding their stamalgorithms that are considering sort of here a proper matrix, that is looking like a correlation between it. But most people would make this into a diagonal, right? So, Another way to look at this, maybe it might be super useful as well. Like, yeah, Digo is from L2. Like if you’re, If you use like alto-regulization for neural efforts, I will too, basically what it is, is that it’s Tita minus 0 times T290. So this matrix is identity because the voice is the same for all paragitat. If you replace these things by identity, anti-ty by zero, also it’s just like basically a force that is pushing you to 0 or pulling into zero. We keep on weight or not for any. So what this is doing is like in that zoom, but instead of pulling into 0, it will be a solution. And you have sort of different, different ways of the, of the regular rizer for different parameters, where you are kind of looking at how important those parameters are. So this is this is the regular edition by solution. So as I said, this is a big family of black birds. There is a lot of things going on, but they’re all under kind of this rough presumptions. What usually people say about this method is that They’re not the best ones, even though it’s the G. And usually the reason for that is they make this patratic assumption, right? So the point here is that this works. As long as when you learn your 3rd class or your 20th class, you can find a solution for the new time that you’re learning, that it’s within the region where the 2nd order approximation of your PBS task force, because you’ve made that. And usually people say that this is very limiting, right? The idea is you’re done in expansion. I have this little area around where you are, which is where your approximation holds. And when you learn something new, you have to stay within this trust region. You’re not allowed to live in it because if you leave it, none of these approximations will hold anymore and all of this breaks down. And usually this is presented as a big weakness of the methods, but they can only account for obvious technoid information. Um, I don’t see later. But these matters, it doesn’t seem to be that problematic. But yeah, includively, that’s sort of what’s, what’s going on here. This is no problem, right? This is a problem of authorisation, but the model, the model, I guess, is capable of… Yes, yes, yes. This is this is yeah, this is not like a fact. This is the learning problem. And this, I’m not sure exactly if you refer into this last remark that I made about the the pantalical transformation, but it’s also not the capacity thing. It’s really a learning thing. Um, and, um, I guess what people expect, they say, okay, you’re in this part of the space, right? this other part of the space, where you have a submission to go, then you can’t go from here to here because you have to travel by far. And this is what people are kind of assuming. I think what we will see is that in fact, it just doesn’t really happen. But yeah, I mean, um, Once you start digging into continual learning suppressions, you will see that most of the what they’re addressing is like learning problems. We should do learning processing. And that’s why I said that this world is bigger than the agent hypothesis I don’t like, because to me, there’s never like, um, the capacity issue that you’re trying to solve. You always try to solve any problem, and I understand why mathematically they work bigger than the agent is something you can you can do math on and prove that it’s true. But it doesn’t seem to be the probability based in fact. In practice, you do something like this that only fixes the learning dynamics, and you solve the problem, right? Which kind of says there was a very capacity problem here, multiple learning dynamic problem that you have fixed. So, so yeah, so that’s um, Maybe the, the, the, the substance of my criticism to that perspective. But it isn’t, the world is bigger than the agent, it is the best formalised. Um, I demand perpetual that. So you have to keep printing for that so that they can, they can actually prove things compared to all the other arguments, but I’m going to be able to have it fixed with that and so forth. So, so, you know, example. Okay, so I listened exactly what I was saying. This works well, but the tasks are similar because you have to do these bigger improcinations. So if one touch pushes, you can move very far away, everything breaks. Um, Maybe this is not intuitive, but this method turns out to be the most elegant, most of the time, the most elegant, and the cheapest. Everything else would lead to your training cost to increase quite a bit, maybe being nearly in the number of tasks. This thing is quite cheap because this is a big start, like, this is, like, there’s like another 2, which has no 0 weight compared to the COVID and backpage in your model. And actually you can start folding disco graphics. So you can end up doing only one contract rate. It’s our right, all the different, the right thing, you know. So there is a, I mean, I’m not going to work the technique, but you can, you can move forward, so like, it’s a variant of this algorithm, where you only have one for traffic and you’re lost, no matter how many it does to exceed. So this makes this story of method is going to be the most relegant one. It’s also the most biologically plausible because this is a local objective. So if you look deeply here, this is the local objective that happens in each neuron. So in some sense, it’s very elegant, but people assume this is a matter that works the worst because it does this tailor approximation, there are we. We, Underbell, yeah, sorry, there’s a lot of facts here, lots of paper, you don’t have to look at the paper. The other variant is, um, I guess people would call them memory-based solutions or deep light solutions. And this is the indirectly like the replay you, you see in the chat. Like, I, because I assume you can’t take care, because it has another module. Um, Iran has exactly this problem. So maybe I wasn’t that explicit. So there was um, There was a point where you had the continual learning community and the continual learning part for microcommunity. And the reason you had that kind of, that now is kind of being right, is because the important writing by construction is a non-stationary problem. So RL just know about RL, it’s already a opportunity of learning problem. And that is because the data at least comes from the policy that you have. I really architecture, like training stuff, the data changes, right? Because now you have a different policy when you’re behaving better. So now you start seeing new data. So, other fundamentally, uh, the non-stationary problem, and it’s, it is a continual learning problem. So, well, Vladini and others were trying to solve Batali, with neural networks, so to pay most if you can pay by that kind of side, the, the nepada field. These are the problems they were having, right? You try to train your MLB or you’re in small fountain or whatever you have and because the data distribution changes over time, that incorporates. So you can’t train your conference. So their solution was that you play about, right? Yeah, some of the is very buffer. And the replay powerful volatile is too, yeah. The, the, the, for the, for the neural allapidation, it makes the data relocation. Um, and yeah, so the replay buffer in general is a congenial learning solution and, you know, there is sort of the 2020 PPA, and then there is, there’s a list of fancy methods, which are sort of plays on, on, on, on those kind of ideas, I mean. Some of these papers are. Now, what’s the newspapers are off at UPR. But anyway, this idea of just playing pretty old. This is also someone biologically voucherful, so they just sort of punks up the brain, you know, stores in the people, numbers, if I’m not mistake, and they Italian, then it’s replaced it, either when you sleep or in sort of different kind of scenarios, but there’s this idea that really play, it’s actually important for biological systems as well. And here, you know, replay, which is also what being used. The idea. Sorry, I don’t have the slides. I’m gonna go here. But the idea is almost exactly the same. So you have this task that you don’t have access to. So you have some data for A, some data for me. And you’re replacing the lost or name, which is on the entire data set. With the lost computed on those few, they end up. You basically doing exactly the same, the same thing. So what what you’re doing is you’re such them playing. The data that you had for an airport, is it? to a small bath card, which is where you play bumper and you use that as a proxy for your entire lobster. So the same machinery at the end of the day. All of these methods, so this is what I was trying to say at some point and then making this assumption about the multicast objective. All of these methods, they all start from this objective and then they do some approximation. Yes. So in memory, example, the effectiveness of the regulation depends on the buffer size. Yes. Exactly. If, for example, in the Buffer, you don’t have one class, then, yeah, we can break it out because the, the, the laws don’t matches, right? So you, you need to collect representative data points that, you know, and that’s why, you know, I was mentioning before it was this works about compressing data sets and showing that you can throw away 40% of CP and so forth. You know, they bring these kind of words where people are trying to understand either some like data points that are more important than others in your data. You’d have to tell you, you’d have overall the project. Um, So, um, So, so this is like the literature in this place, this is a kind of question. So like what do you do when you have a value with the buffer, like how do you play with it and like what you put in, how do you throw away data and so forth? There is another family of methods. I’m trying to reify put it here or not. Yeah, I put it here the bottom. They use a genactive model. So the idea, I, I, I mean, it’s not my my computer. The idea is, instead of having a buffer of data, You’ll buy it in general, the data, and then just sample from that mouth open now. So you have a service. And supposedly this has again more biologically plausible and whatnot. The reason I wasn’t as excited on this paper that is coming out is because the generative model usually says too much because of the buffers. In terms of public data, you need to store. Doesn’t always work as well. It’s much easier if you have a buffer drive than than the trenches than the models. And also the internet model takes a lot to put you to trade. So, you learn as far. And now you have to be the technical model through data from passport and then there’s almost takes as long as very passport and then you move the passport because that’s going to tomorrow. So it feels like very happy, but it is sort out, but there is actually quite a keeping ceramic and there’s a lot of, and we can use alternative models to do other things as well. you know, once you have a good energy model of your tasks, you can do fasting friends, you can do other side things with it. So, you know, maybe overall if it’s worth it. Um, Yeah, I put a slide. Here. Because this is a quirky math thought, but I thought I mentioned it in case any of you find it interesting. So like we heard, um, When people started working on these things, at some point we ended up working on this MacBook that we called memory-based for a metre of production and we play. And the idea here was, there is this, um, There’s this notioning of data about the science about complimentary writing systems. And the idea is that the brain, you know, doesn’t, it basically has complementary kind of mechanisms to learn, that it leads to different things. And I, as we talk in the past, we’ve noticed this before for finishing as well, right? We have things like 10 years, paper, that I put stuff for proficient from Australia, learning, and then you have things like your networks that work well. So this MPPA, this MPA method, is trying to make staying in his neighbourhood with greater like these. And the way it works is, I think, is kind of interesting. You have an MLB that you learn. But then every time you need to do an inference, You have a replay buffer, you call the data that you see. So, you take the data point that you are trying to do, you just, you go to the replay power, but in this case, the replay power is huge, like solo data. So you take your data point and you search 4K neighbours in the big rooms like buffers. You repeat the gay neighbours. You train your model on those gay neighbours, and here you just trade off the layer, but it doesn’t look the recycling job. You trade, the overkater model of those K labours, and then you throw away whatever you’ve learned. So here the idea is really taking this. The world is bigger than the agent kind of new to heart. And he’s saying, anytime before making a discipline, Let me repeat my model. The data are very close to me so that I very well have church out of the local structure of the world around me. May quite predictory. And I provide this model gators is not useful, physically, because then things out of the local space are… So I find that idea kind of work here, right? I thought I’ll mention it. The other, okay, so this is another paper that I was meant to mention. So, uh, Another thing that you can do. Please, um, Instead of picking data points from your replay, putting your replay button, you can learn the data points. So the idea is if you are, you assume So basically, okay, so the reason I’m trying to present this paper is because this is trying to show that there is a continuum between memory-based solutions and regularization-based solutions. So this is actually the kind of the same thing and you can very, to do the same thing. And that’s kind of also obvious when you look at that multitask objective and say, I’m just, I’m just approximating it by doing a few data points. But basically what this method is doing is saying, I can only store 5 examples, instead of picking those 5 examples from my training set of the previous task. Let’s to break and descent on the examples themselves, and learn something, that could represent that loss. And you turn it out that you can compress your data, that’s even more you do that. So the data points you and this cloth here is trying to suggest that this is coming from the ocean processes, when they have something like holding 0 c points, that kind of back like that. So this doesn’t matter. But the mechanism is really just this. You take your model and you learn your inputs, such that the laws play the same. So you have your normal model with all the data from transport, you have your model, the same train model, but now you’re learning acts type, that the loss of distance is the same with the loss of that thing, and you propagate. So you do a means to get out because the losses and then you, you propagate to your, to your ebos to line them. And well, usually comes out of it, is dying out, stinks that look nonsensical. Like, for example, if you do this on, I’m nice. We get this picture, they look like production noise or like something very weird. But if you put them through, you get a very similar dose to what you are having, um, if you had the real very time, and this is just somewhere for pressing the data, and now you can think of this learning data points, you can think of them as parameters. So it was like, it basically becomes most, more, more like a situation thing. Um, So, people assume, Jedi, speaking, you know? Okay, so no, not able to show. Empirically speaking, usually these kind of replay methods are the ones that work like that. For games, I think. So usually, yeah, if you, we can ask around, you go to say that replay is what works the best. Yeah, usually the ones that are easiest to implement. Usually it just stores on data. And then when you’re doing your learning tasks too, you add your mini pack a few examples from the previous tasks. So you don’t need to change anything in your code. They’re very initial to implement. They don’t want it usually works your best. Uh, The issue is the more tasks you see, the more data you have to consider, everything that you do, because you have your mini bags from your current task, and then you have a mini bags from the PVS pack, and a mini bags from the past, before that, and another mini bags. So the number of data points kind of just keeps increasing. Uh, with the more and more, uh, things you you are. But, but I usually consider to be the best performing one. And the reason that considered to be better than anything else is because they don’t make the state of the factor, of course nature, right? So here, The approximately the loss by by such temperature, the data, but that’s not the same thing as doing a daily pressure, but you’re not able to train for a ton of photographic approximation around your.. Like any questions so far? Yes. I’m sorry, I don’t really get down, like taking data. You said, taking it up, playing from, I mean, most. I don’t get this usually. How did you think, you know? So you have, you have your, yeah, so sorry. So, you have, like, the culture thing about, with computers, right? Because other ones, the computer laws that you are not saying. So the idea here is you, you have your task one. Right? So now you’re going to pass, I should get a market, right? So. Okay, so you had your local tough one, so the 12 days, L one, that’s on Tita and some beta. Data one, the 1st thing that’s it, right? So now when you go to the 2nd task, You’re not supposed to have access to the data set, like data, you have access, but the data set is the one that disappears, like you’re now moving the data set. So instead of keeping this, what you’re saying is I’m going to subsample these data, so I’m going to pick a subset of examples. So I’m going to construct it, if you want, build that. where I, it has like a lot less than your data set. go. So I’m just like, you know, if I have, uh, I guess, uh, I mean, they have 50,000 data push, right? To say, like, I ID, I picked 500 data points from those 50,000. And now when I’m computing, when I need to do L1. So like what I need to do next is L1 Gita. D, one, plus L, 2, D, D, D,2. This will be my multitask object. This is what I really want to do, right? I want to learn this to that together. But I don’t have this guy anymore. So I just replace it by this. So just use only those 500 examples and I assume that the expected gradients on these 500 examples. So if I if I compute the average on this 500, the expected gradient is the same as the expected ingredients on the entire data set. That is kind of the assumption. Obviously. Like, I guess for those who’ve done a lot more statistics, you know that like the number of samples, you know, the fewer samples you have, the higher variants you have. So in some sense, you know, the, the, the, the gradients on this smaller data set and the gradients on this data, so they’re supposedly are unbiassed, uh, you know, so this is kind of unbiassed, so that they should cuddle in expectation pointing in the same direction. But obviously you are not resembling them. You sample them once and you keep them fixed. So because of this, there’s going to be some bias here. And and then because the variance is high because this is uh, this is small, like you, you might be quite old. So this is kind of the issue, right? You need to pick those data points wisely, because if you don’t, this this thing might be quite different than this quantity. Because you, you know, if you would be allowed to resample them at every step, then this would be a non-problem. This is basically just mini badger SGD, right at that point. But you’re not. It’s almost like you only pick a mini batch and that’s going to be your mini batch forever. Uh, But yeah, this is the intuition is you’re just approximating the data set by a smaller data set where you’re just jumping. Um, and this is sort of just a valid way. This is just yet another way of approximating the same thing. Um, Yes. Can we say that replay turns continual learning into IID learning? Yes. Yes, absolutely. That is exactly what he’s trying to do. I mean, the regulization method is right. This is the same thing. Just using a different kind of approximation. So all of them, like this multitask objective that I’m writing here without just your data. This is the idea like, right? This is just, I almost asked at the same time, the ID lining. So what you’re trying to do is you’re always trying to approximate ID learning by coming up with cheaper version of what is the correct thing. And reply is trying to do that. If you do replay with infinite memory and I, the sampling from that memory, then you basically don’t get the correct thing. And that’s why people typically say, no, you need to have no constraint on your memory, you can’t figure it for everything because I mean, it’s not that you should have just asked. I don’t get the difference between these techniques and multitask learning. Are we trying to approximate multitask learning or what’s going on? Yeah, so these techniques are meant for the situation where you are not allowed to do multitask learning because you have to see tasks sequentially, but all they’re trying to do is approximate multitask lining. So here the assumption is like the robots in the real world is the example, right? Like you can’t do multitask learning because You know, like as things changes, like you can’t really store everything that you’ve done and you can’t revert time to go back where you are to like get another observation. So you have to, you have a stream of data that comes at you, you have sort of data set that come at you, and you can’t do multitasks because you can’t have all of them at the same time. So you need to approximate this process somehow. So you do that by either you store observation from time to time, that will kind of be a proxy for what happened in the past, or you build this regulization. But yeah, I mean, the goal of all of this method is to simulate the multitask lining objective. And the point is, if you’re able to do that, that you can learn in a known, Um, Yeah, in a, in a non-ID setting, in a sequential setting, because now you’re replacing this plant that you don’t see by, by, by, I think, yes. You mentioned also, uh, regions go into, uh, decide on how to select those to return to the term, but… Yeah. How did you talk about? Yeah, so a bunch of these patients are just like curious things. Um, that curious takes a lot of time take into account how long ago you’ve seen the data flight and things like this, you’ve made some kind of process that would allow you to do the right thing. This particular paper. Here you can do 2 things, you have a data set. You can either frame this as a normal optimisation problem where you learn, acts, prices, a lot is the same, or you can practice a discrete optimisation problem, where what you’re optimising is you basically have like a weight in front of each data point. And technically those weights can only be near worldwide because if that’s meant to be a select, why, it could clear up its life. Um, and you’re like optimising over those ways. But then you’re specifying those ways that here. So the whole point is, like, that would be sort of the way you produce the relaxation process. So the whole point is that instead of learning the input itself, you are just learning which they are going to syllactically, you may be some of the relation problems that we have to optimise. There’s no, like, theory behind, yeah, absolutely, like, the present points, but there is, like, not, I, I, I don’t, No, I don’t think there’s a lot of fury, because it’s not very clear. Yeah, there’s not a lot of theory. There’s a lot of, like, you literally kind of drinks that there’s people too, basically like kind of compression stuff that they used to be doing. I. Um, And that’s sort of the 3rd family, that’s kind of the last 2 argument. And this is one of your discussing yesterday as well, like the progressive natural action. I mean, there is back some photos on this. It’s really this idea. So saying, I learned I learned to ask one. After I environment, I preed the wakes that I important Prataswan. So here, for example, these are the blue weights that are being used with have spot. And then from then onward, I’m gonna learn that to on, on, only in the sort of the, the units that happen in the browser. So, the near approaching might be the nicest one. So here, the idea was, imagine you have an MLP, you’re like attached to one, what you do is you prove the units that are not being used for transport, and you end up with this very small MLB here. So she may have something good to know which ways are important and it’s ways are talking important. Usually the challenge back to this kind of such security. And honestly, if you do this and the analysis, you decide which and it are important, and you throw away with them, that’s not important. But I have a shower about people. You don’t know what you’re going to do, if you’re going to add back capacity. That sort of be entalized would be just sort of as a normal NP. So basically other new columns here. And then this whole column is just a new MLP that is training. The real problem is frozen. There’s no change in these parameters. And all you’re doing is this new penalty, is able to read the teachers that you’ve learned previously. And the idea is he previously aligned a feature that is useful for distrust as well. You’re allowed to use it, but you can read it as input. But you’re not allowed to change anything that happened in any kind of stuff. So this methods, what they’re nice is usually you have a guarantee that it is ultimately not forgetting. It’s a good, literally like everything on the parameter, so you’re not allowed to save anything. But usually in one form or another, they end up increasing the number of parameters with every task that you do. So this is just basically a play, like the simplest form of this, is you just learn a different model for each stuff. You learn a different one for each task, then obviously there’s no problem, right? It’s like each, each, each model, like, sits on that. There’s no interference, there’s no forgetting. Everything is perfect. The only thing that you don’t have is you don’t have any transfers either is you’re just learning the fasting dependent. So what this is saying is, well, if I have to go through this process equentially anyway, because this is the structure of what I’m doing here, I might as well take the previous models that I’ve learned. Push that they have through them to see what features they come up with. And use all the traditional inputs to my model. And the idea is that my model that you can decide, well, this feature is already useful, I’m not going to be learning it. I just going to use the, you know, for the rest keeper. This is kind of the high level. I mean, there is a mark, a lot more. You know. things that people can do and not not, but this is sort of a generic kind of view of this modularity or, I don’t know, model expansion, there’s different names that people use for it. But it’s kind of, this is the, so this are kind of your choice. You either build a regorization there. that regularises the entire models. You either start the data in some form, and you don’t need like free numbers, or you’re treating the weight some particular way, and then, you know, you add new weight during whatever is yours. These are, the prototypical 3 family of knives that people have done for a different country. And continuing our rest in this discussion about modern learning, I mean, that’s similar, very similar to more than American, but solving all things. For example, when I’m managing 2 models, I’m trying to remove the non-used weight for task A and. Yeah, yeah, it is. Yeah, you take a, using, you like the system. Yeah, then you would like the projection, yeah, but those are going to become partly pointed. I mean, there is a lot of IQV doing. Actually, there is, uh, Sorry, okay. Now, I, there is also like projection networks that are even closer to model merging that are kind of okay on this where the idea there is that, uh, Instead of angulating gang connected, the way you do. I think I have this plug of it. Well, you just being the protection of the gradiants. So you’re saying, okay, there will be some space, you’re not allowed to, like, put on the person, and then you’ll be telling the gatherers. This is also very close to modern noising. You already have the primary day, in Jersey, for them, together, and you just need to figure out if it’s up, right? There are a lot to change, but there is sort of These projections, techniques are in between these categories. So these categories are not perfect. I don’t necessarily feed. Nice being any of them. But this used to be the prototypical kind of categories of matter before. And going indefinitely. Oh, it’s uh, doctor is like… So this method is like the model keep growing. That’s a problem. So controlling the drought, it’s sort of the question. depends on what you’re doing. So people usually complain that if you have a corporation that this can, this drone would be like, look at, look at rhythmic, or you hope to get something near in some way. But the whole, the whole goal is to have the growth to be a disability yet. The other thing, as I mentioned before, as you keep adding these features that you’re in from, they start acting as noise and they actually, instead of seeing forward transfer or positive forward transfer, you see negatively. So basically, you start diving slower and slower because what happens is you have a lot of this type of features that you get from the previous learning models. And if they add, like, noise, it makes it really hyperlink that you send to figure out how to learn any task. Um, So, Yeah, ACD is sensitive to the signal migration at some point. So in the signal to noise ratio is low enough. It will figure it out. If it doesn’t convergence, you’re going to slow down. So this is usually seen under 10 years because the whole point of this matter is, you should see some kind of work. perfect. Something, how many is end of my own. Yeah, so I reimbrace that. Okay, so this north of this family of novels, at least, right, and the kitchen or sandwich, you don’t have to make the others. But you have different output hats. for different tasks. And then when you have a different time, you just pick your right off the pad. So you kind of assume that at the implement time you know which task you’re supposed to prove. And because you know which class you’re supposed to do. You know which of these columns you’re supposed to use. But yes, they in this initial issue that you need to have this kind of routeing information. Well, for the regulation, one, you don’t need to have it. It’s, like, it’s only at least within a model of other kind of blood, the routeing by itself when you’re when you regularise it. So yeah, so this is a, here is another… I mean, this is the method that gives you like 0 that that’s what you’re forgetting, but it makes the livestumber consensual in terms of like what you’re supposed to know and things like that. The least elegant one, I guess, to be that way. Yes. Okay, you said that the regular presentation results actually has, uh, that it’s not good for TikTok that I’m not thinking of. Yes. So which are the really best food? So people would argue that we try to reply is the best, those kind of scenarios because in reply, you don’t need to make this tailor approximation. So, in a few slides, uh, we, we, like, these advisements on sort of larger scale, which kind of data sets. We, sorry, so what happens is you go to these sort of financial data sets and you run regularization-based matters, then you are rather implying math. And you find metals work way better, which is what everyone expects and everyone thinks that the base. So in that paper, what they try to do is to see why there’s a king. See, for the reason that he probably, they think that they stay like fashion breakdown and the regularisation can do what is the process. And what we see is when you run your greenplay. You still converse the solution that we deem the tailor expansion of the previous task. So what we see is that it’s not the assumption that breaks. It’s a different reason why it is the word. So I’m just putting there as a cardiac. Empirically, we noticed on relatively bigger data sites. But these 2nd other attractions doesn’t seem to be that problematic because refly quite exactly the same kind of solutions. But empirically is also true that replay all of its works better for other reasons and then we don’t understand. But yeah, in general, among, Okay, so this method is algebra method, so we forget, so in terms of catholic, which is optimal, but it has a lot of this other tag, yeah, so you need to know what the task it is, the normally you think. The replay, at least along across this, like, 3 family is usually considered kind of being prepared matters. It’s also widely used, even if it’s not acknowledging the continuing problem. For example, even when you train LLMs, you have forms of replay, you have people in the building. But like you always do replay when you find you and you have at least the people of the data that you have deploying a data set and mix it up and stuff like that. It’s very easy to implement the book. Um, So still on the modularity, I think I wanted to mention that there is this paper, I put it in here because of the discussion yesterday. So this is sort of the wake sleep thing, right? So what it does here is that, in the weight phase, I add, you know, a new column, and this new column, beginning of learning the dependency, and all these indoor features, from my knowledge rate, it’s a political fault. So, so basically it’s like a progressive network. It’s like watch them for a servistic network, right? I add a new model that is learning. The previous model is frozen. Nothing can change in this one. And I read out the teachers from the previous one. I learned that this was the waste page, and then I go into the sleep page, where the speaker is, I think, is this product that I just learned, and uh, compress it back into the knowledge page. And when I’m doing these operations that I basically using a recognization based matter. So the engineer is, okay, when I learn something new, I use the technique by location network, because that’s the work that’s most likely to give me forward transcript. And so large, much faster than you have. And then I’d have this sleep train, so I’m not interacting with the world anymore. What I’m doing is I’m using regularisation, basically, you know, lightly, because there’s this new thing that I’ve learned, it really speaks the public. This is, like, yeah, sort of, uh, by the brain, so forth. Um, So this is your summary of what I said, because of this, um, because the teacher, the size teacher, that we have, uh, noise, you know, you can look even at the, you know, paper, you can see, negative transfer, and gaining platari, and so forth. So that is a problem that is injured, I’m not worried about. Um, This is the, I mean, like we finished the 3 family of matters. This is like the total normalisation. This is closer to what I said before. So this method, what you do, is you are building this projection. So when before you’re doing your update, you, You move from your gradia, or you project your gradia to a subspace. That is different than the sun place where the parameters are important. So Yeah, so for example, if computer attention, right, your attention is going to have something else based. So there’s going to be some directions where your has, like, had sort of 0 item values or very, very small item values. Those are directions that you’re allowed to move. So now you’re doing, you’re what you’re doing in the projector, pay the answer. So you basically take, the data can only move in these directions where the attention has depends, uh, a very small, uh, So like, okay, I’m just trying to see if I can change in the morning, 54. So if we go back, I don’t need to get it first. Um, It’s almost like in elastic consolidation. Okay, so, okay, so usually, I’ll keep on it, if I can write this, where it can be. So, in, in, in, in a lot, in, um, in organization-based methods, you have Tita N, because Tita and minus one, minus some radiant. Plus, previous solution. And some regularisation, right? Some, some, some, um, at, yeah, some metrics, right? This is so this is your normal update. And this is your regularisation third, which is basically pushing you back to data start, which is the previous solution. And it has some waiting for 5 minute waiting. In a regularisation method, what you do, you do something like Visa, minus, minus, ETA, and then you have a mask for your projection, the gray idea. And this mask is basically, Zero. Everywhere where F was large. And is one where F is small. So does that make sense? So now instead of adding this additional turn, you’re just basically saying, okay, this waiting here, tells me that there’s some directions I’m supposed to move and some direction I shouldn’t move. Instead of adding this force that is pulling you back, I just simply say you’re not allowed to move in those directions, I create my projection. It just says, well, if that is lagged enough, just never move in those directions so then you don’t need the force to pull you back. And where app is small and you’re allowed to move, might as well don’t regularise the dog, just allow it more. That’s kind of what the projection matter does. Pretty kind of tweeter, right? So if app is large, You used to have, you used to have a strong force that could kill you back. So instead of doing that, you just put one put 0 in the mask and you just remove the component from the grain. So you’re saying never move in directions where I had a strong force because, you know, if I never move then I don’t need a force anymore, I’m never going to change. And then where app is very small, where beside directions where I would usually normally move by the bay, I just don’t regularise those directions anymore, I would say, they just ask whatever it wants. So this mask usually, I mean, there are different kinds of projections. It doesn’t have to be a 0 one mask, but this is maybe that’s interesting. The hard projection would just be like 0 ones. I’m just saying there’s some directions you’re allowed to move. Some directions you’re not allowed to move the dogs. Um, Yes. He was asking first. Okay, cool. Done it. He was asking first. Okay. Ah, okay. Uh, sorry, but… So, continue learn. Can we say that continual learning? Depends on, uh, how far the parameter, the parameter vectors for each other, if they’re so to each other, then the continual learning is easy, if they’re far, then catastrophic for getting it. Definitely. And usually that is true because if they’re close to each other, then you can do these tailor expansions and then and then sort of lineality works. And in the ID kind of, you can, you can reason about how much my, my function is going to change if I move here or not. Like if things move very, very far apart. The linearity, you can’t linearize the system, right? You’re going to have higher order terms, and that makes it impossible to tell what you can and cannot do. So usually most of continue learning algorithms are trying to find solutions that are very close to each other. So where these kind of regularisation terms and everything else works. Um, So you were… Um, Sorry, I feel like I don’t have an account. So the VAE was here in him for, for, I don’t remember what I said. This was in, uh, for a deep flight, maybe? Yeah, yeah. Oh, right, then you’re, you’re, like, whether you can go actually, like, the, like, so you can learn your GA in, this, nice, and place, and quickly play in the right place. So there is. quite a lot, if it’s a side of matters. Yes, it helps a lot. So what you’re projecting your daytime to a lower dimensional work. Some better structured place, it’s easier to do your generative novel. If you learn practice, it’s more reliable, it’s really a good fun feeling. Uh, the key or like what becomes hard is how do you how do you build this place in place? Why did you know that make of the prices, okay? And like what is true is you didn’t get that wrong, it doesn’t work. If you have a real life of space, it works way back, right? But there’s definitely quite a few papers. But this is definitely something it’s called a later generative way play. Because then it happens, like in the spices, it’s the whole, we need this afternoon, one of people, the English language. And there was another question after the question or deliberate. Yes. How would you require the 1st hold on? But I can find them. How do you find a mask? Well, like, the 1st call, it gives you the, I mean, you want to, I’m seeing what this movie is. So your 1st computer app, where app is just extensively analysis, just your hash app, is a diagon of the hash app, and then you decide on a threshold. of the, of the, so the, if you, if it’s diagonal, you decide like the icon values of your attention. And then you decide on the, and this is a hyperfrimeter. You say, like, anything that’s smaller, the intention, the minus one. It’s gonna be uh, one, everything that’s bigger than 10 to the minus one. You know, I’m not allowed to move. But this is really a hyper parameter that you have to tune by hand, right? You compute like the analytical part, like computing the fish, doing the analysis, you know, there is one way of doing it. You just go through the map, you commute it. But then deciding how you want to project, then it’s going to be up to you. Like, how conservative you want to be. So you have to decide the threshold. Some people do something where They scale down the learning rate, depending on the manages of the t-shirt. So it’s like the soft version of this, where it’s not saying that you’re not allowed to move in this direction, but you have to move much, much slower than in this action. So there’s like softer versions of this and so forth. But yeah, there’s no magic answer. There’s no true answer. Like, you really just need to tune it and It’s not that sensitive, like, usually there is a threshold that kind of works for most cases, but. Okay, so I’m thinking we should do the 10 minutes, right? Probably very badly. And then we get nonbility ideas. Um, I had another imply here. But I’m not sure if I have this. Well, in a big part, in a spy. I’m not sure how much I should know about this, but. But, Generally speaking, like I said, what this life is trying to say, is that, If whatever I ask one. Even from a version of flatter minimum. And even if you don’t do anything, what would I have to do, things are going to be bad. And that’s just because if you are in a flat in your mouth, you can work a lot further from where you are without the facting the loss. So, um, I mean, I don’t know. That’s kind of the high level idea, right? So the, like, you can help, so I think the writing, not necessarily by focussing on the interaction between, it has one and times. They just look at the time and pass and see, you will ignore it all the possible that’s maybe mine stream after it is. And just try to find us, like as possible, mini market, how I pass. And that will have significantly, 15 pads for the amount of a get in the industry. So this is the high level idea. But I do hope I have so much light on this. I really spend this a little bit better. I think one way is talked about, is that It will turn it out that the choice of architecture that you have, or even the choice about violet, sometimes it can work or have an equal impact on forgetting. And that’s some of the alcohol that will be described. So for example, we were looking at the cheap part. The CPR or my product. I think it was pretty far. For example, at the end of it, because it’s on a bell point at that point. Um, and by convention, everyone used was usually the restaurant. That was sort of the combination in the field, then you want to do the press method, then you’ll put your algorithm of choice on top of it. We pass on the play thing, because some of the normalisation thing, you got your, you got PC or whatever, it was, your recognization, based, my, that’s not. And then usually people won’t compare things this way. And in a word, with married others, what is done? Please check, okay, what happened if I replace the restaurant? With a BGG landmark, which is just, you don’t have no more than that. So I haven’t changed anything, like I’m still just doing later itself to my training. I’m not doing any continuing matter. I just, you know, like rest might be the best thing. But BGT works equally well on your data, like let’s just defend that effective scene. And we saw that the BGG forgets a lot less. So which is it, by spending the architecture, we’re working as well as it, obviously. So, you know, you start out of adding this fancy also to the WhatsApp. You just speak as different architecture, it doesn’t the upward thing as well. And this has to do with the kind of solutions, different architecture is fine. Um, so. So this is kind of the point that I’m trying to make here. In case I don’t have this in a toilet slide. Are there super interesting results that we had was that we were doing replay. But recently where you had like infinite era. So, so imagine we do that to work and then when we go to class 2, you have access to the entire data set as one, and you sample ID from it. So, like the, like a really badge. So basically you’re doing multitasks, but you’re doing multitasking as information. So your question is, that’s why, the thing like, that’s part of that’s too, and so forth. And this will work extremely well, obviously. But what we wanted to know is whether this will push you outside this place where the 2nd word data is by the Indians passport. And the outcome of these experiments, even when you eat quality tasks in a row, was very fast mine. Is that your 1st pass? Do I take your expression on your 1st pass? And then you you, you move, you know, you go where you end up after doing the guitar, can you check whether the type of person still holds, like how much is the approximation error of the data expression? The approximation error is super low. And we found this extremely interesting, right? Because the main argument in the field was that the replay works so well, the multi-pass works so well, because No, we need to make this data expression, which is probably wrong. So like one interesting outcome on this, that’s right, mental. Well, then, why what’s going on here? And obviously we don’t have a final answer, which I can tell you some observations. So 1st of all, when you do regulization-based methods and you have to do this data question, that’s not your approximation that you make, right? You usually make the same expansion, but that you make the questions, they have the valve, you might just raise the attention, much more various, you might remove this here, there. So one, one possibility is that all these extra for the information that you make on the way. There are very problematic. But if you just do the data expansion and the change, Maybe that would be fine. So this is one possible answer, right? Okay, okay. The other answer, which I actually find even more interesting, is you can look at the regorization matter. And you can try to look at the directions of low curvature. So basically the direction where this app is small. right? So what we did is we completed this and we asked, what is the subspace, you know, which my recognization method moves. So like, let’s say we do these projections, you know, what is my mask? Which are the directions I’m actually moving and which are the direction I’m not doing? And then look at what my, uh, replay metal does and ask, easiest moving in the same directions. This mask is telling me. I’m supposed to know it. And it turns out that it’s not. So to check it out that, you know, I mean, I mean, Like disagree. Right. So, If this is my mask. Right? Let’s just say, These are the directions of high temperature. These are directions are not allowed to move, right? And then there is these directions that have low temperature. And this is sort of my legalisation method will say you’re allowed to move in this direction. So each of these is the strength, the long one axis, one direction, I don’t know if you understand, not I’m trying to grow here. But this is a such place you’re allowed to move. It turns out that the replay method uses a smaller subspace. This is what replay uses. So one thing that I found, it’s really interesting in this kind of experiment is that it turns out that there’s some directions where your Hashian says, You can move here as much as you want. Nothing is going to happen. But then as soon as you start moving, because of high order charge, something like that, the hasher suddenly becomes large and things will change. And there’s some directions that have no temperature, but they’re stable. Like you can move in those directions for a while and nothing happens. Um, and then somehow if I can distinguish between these two. Well, if you just look at the hash and you can’t distinguish between them, because they’re just directions of location. You can distinguish between them, which you compute the 3rd order. Because the 3rd one, the reality will tell you how fast is my national changing. And that’s all these things too, right? Because as you go higher, in order, it generally says how much the previous one is changing. So 3rd one is the university, you might be able to help you plaster your your item value into things that are stainful and things that are not stable. Anyway, I’m just bring it up. This is sort of a right direction that I, well, I, some students I work with are actually looking at right now. How is the replay? How is the repay able to text with your consideration to start on Apple TV? Yeah, so it’s not even trying to rely. So replay, it’s almost like you take a step. And then the rapier that is coming from the replay is pulling you back in all directions that are hiding the loss. So, increase the plug, like, you take the, it’s the same as the requires that you take the step in all of these directions, but then the, the, the great jump from the replay is going to pull you back on all of this. Because, you know, even even the actor’s more perfect, you will see the compared to the regulization that is, the regulization method could do that as well. If you recompute that I share. So like typically what we do is we compute this map, this app, or app, or whatever. We computed once, once you’ve learned as one, and then we keep it fixed. We never updated. But if you would recompute it from time to time, you’ll see that, well, actually now that I moved a little bit, this is not correct anymore. So that would be another way of fixing it. But, yeah, people will. People say you shouldn’t be able to decompute it because recomputing it means you have access to data from that’s why, which the whole point was that you don’t have access to it anymore. So if you, if you have to recognise it from time to time, the kind of things that you saw it, the data, and keep it along with you. Right, everybody there, the whole point of the operate on. But that’s, I just think this is a kind of interesting kind of observation that, which is maybe not as surprising, but it’s like, you know, there are directions that are, that are not sensitive for your loss. that are stable and some of them that are not staying involved. And that could be part of why replaying what’s better, then, than my violation. Rather than the whole tailor personation is that people usually like to bring up. So, um, Yeah, this is just some results from the paper, this is the paper that I mentioned. So there are multiple actions. So I told you about the VPG virtual, uh, restaurant. Here we do something else, which is we just make the network wide that. Or we make the network deeper. And you upset that you need to make a natural, it’s deeper. It doesn’t help you at all. If you make them all the wider. So you must increase the weight, then naturally, without applying any kind of functional lining, alcohol. You, you, you have less and less forget. So the wider amount, the less weight I think you have. And furthermore, I guess we just sort of like, this is just the empirical object range. But, you know, it does some very little biting behaviours. Your grade has become more and more to run out. Maybe that’s not surprising because, as you make, the mouth of wild, you’re in high, the main 1st place, random factors in high dimensions, they are important to each other. But it just sort of, as you make the the model wider, you have lesson, yes, because the data you have more of the model. And the other thing that I find email it more interesting is the greatest becomes twice, right? If you look at the greatest itself, If you look at the distribution, like a huge programme of the magnitude of the entries of the gradium, they concentrate around zero. The wider the model. And it’s again how they have like mercurians. And this is, um, This is the, the, the, the, the different architecture thing, with the rest and the, um, and and the, um, Sorry, the VGG. I just don’t see the GB there, but I think the WRM, the countries are the medical schools are affected. But anyway, you like, you can go to the paper, you have to paper the both side of where that’s work. Um, and um, The point of all of these papers is that in the field, we typically focus a lot on the algorithm and we come up to both these kind of algorithms, you think, but the architecture itself matters a lot and sort of decide in the modelling methods a lot. Um, and in man, there’s a lot because of this plan cinematic. That’s my increase in their connection. So, I just like, though, you can change sort of the behaviour of the continual learning problem by the data. It’s something that’s being kind of ignored by the field. prior to these words. Um, Okay, so I can skip this. Like, this is the one interesting thing that I’m mentioning, with this table, an unstable subcase, so I can skip it since there’s nothing there. So, I wanted, okay, so under those room, okay, that’s what you said. I’m going to go through a little example. So when I started talking, um, about continual learning, I was trying to make this, um, This argument, that if we all continue learning, then we might get more efficient planning, because you have all of this reputation for different events. So if you somehow are able to deal with project, you know, even if you don’t have a equational problem, even if everything is IP, you might still screen things out. And um, I just want to show you now very enough, then we tried. That did not work out, but it still doesn’t work. But just to show you that the problem is not really, I’m making it sound. Um, And this is also connecting a little bit of critical of learning and all of this idea of get to feed your orders of the day, exactly, and so forth. So, and and then this is, um, Yeah, there’s a lot of text here and I, I, I think it just doesn’t make sense because for all of this people in front of it. This is based on some of citations. So this I just random observation from day perks, that I saying that as you learn a model, You have this pregnancy that the model knows more about the last thing because trained on. So this type of observation come back from the 1st time we started in China on Wikipedia, where if you are looking at the imperial articles, the code. Yeah, because the traitors have, you could answer much by the right question of the Bathish, and if you’re asking from early, you need to do the articultural work, worse. There is some kind of similar attracts, you know, in selfie newspaper. There is some conservation we made in the existing act of the Indian football. So it seems that there’s an optimisations generally have has this bias to overtie the relation data. And like, so the undercome we tried, it didn’t work. It’s actually quite simple. You say, okay, this is my data set, because I start learning. So what I’m gonna do is I’m gonna regularly look at the performance that I have on the data points. And I’m gonna class that my data A2, heart example, any data based on the laws. I decide I’m trying for the last, and I say, everything that allows to snow is going to be deep, everything has to be absolved. And now, once I, uh, so this is the after the 1st step. So once I plan this, now, I think of this as 2 tasks. This is task one, this is task two. If I think of this as a continuing problem, is I have a task one, that I’ve already lied. This are only examples that I’m able to pass by, and I have attached to that, I have not lied yet. So how about I find this will continue with any problem where I’m just looking at tasks too, the hard examples, and I add some kind of regularisation not to forget that spot. And the idea is that I don’t like on progressive, the higher examples that become smaller in Walmart. And you’re actually going to stay in a lot of computer because you’re going to replace the updates you are doing, the green bridge, by some regulations for that history. So the high level idea. Um, how many brothers? And it did not work. And the reason it didn’t work, it’s a great kind of interesting. Okay, I removed all the slides I had on this because there were way too much detail and I felt like I got into much detail on how important building. When I listen, it doesn’t work. Okay, so 1st of all, I have to be honest, like, we don’t really know why you should be back in Alpha King. Like this is immediately, it makes a lot of sense. Actually, this idea is not even new. We talk multiple papers, trying to play, that they’re trying to do the same thing. We try to replicate some of those papers, they don’t talk either. And our paper wasn’t quite an either. So I can tell you something about this. So one hypothesis is that If you look at the data point, Can you just look at the whether it’s awful or not? It doesn’t mean that the data point is actually solved. So the problem is, For example, a model can classify the picture of a card. There is a card because the sky is closed, right? It can rise to some some serious correlations. So, say, the majority of features of cars in your data set have a blue background for, like, lucid ID hydrate, right? So then the 1st thing that the model might learn is that anything that has blue in the background is a car. And obviously this one means classify some of these examples, which are not tied. But for all the cars that have that, there will be seen all the examples. So what you’re going to do now is you’re going to remove from your data set, all of the examples of cars will look backgrounds, because you assume that you’ve already learned that part of the data, well, actually, you have not built the right features to detect the car. You’re detecting something much deeper. So there is this issue of Looking at the laws is not indicative, whether the model has learned the right representation, to be able to deal with data for it. This is one hypothesis about going on. Um, the other thing, but this is maybe, I think this is the main reason, if I’m honest, the 2nd reason is what you’re largely become hard that sometimes can be outliers, like misclassified data points, right? You do this on a large scale, and then that will just outliers. And then what’s going to happen is because now you’re focussing so much on your appliers, you’re actually going to harm, you are classified. Like a normal learning legend that will just learn on the entire data set. But at some point starting going out like it. But because we expect violations and weaker than actually doing radio descent on those data points. You’re basically, it’s almost like you’re waking up your outliers. And your normal starts paying a lot more attention to our driers, then it’s trying, then paying attention to the mode of their distribution, the son of images, and again, you get who it is. So this is a potential other justification that we have. But all you know, I’m just, I’m just trying to say that, um, this is sort of a way to construct a particulum automatically where you’re kind of filtering the data as you go through learning. Um, I was very excited. This was the research I was doing last year. I was very excited about it. I thought, we tweaked it now, but when I get everything to details, that’s more efficient. I just doesn’t want to worry about. So now some of this is now that you can’t rely. You use this, even though it properly, before the details, You have to converge and it has to get a recent performance. Like you end up using us might compute, uh if you didn’t do any of this. And the reason is either because once you get into outliers, you need at the end a lot of outage to figure out how to deal with that. Or because you have these cycles where you are protecting things that are not useful. And then, you know, at some point you need to realise it’s not okay and you need to like pull them back in that heartset and all of this stuff. But um, But yeah, anyway, I just wanted to give you like a much more research, sort of more recent flavour of like how you can use these concepts from continue learning. I mean, the thing I liked about this project was this, this happened at a point where the community, There was this talk that, uh, sorry for getting is not useful, and we shouldn’t, we’re working too much on it and we work on other stuff and so forth. And in this work, it’s all about catastrophic forgetting, right? That’s why I was saying like, no, so you’re forgetting it’s useful because it might help you to learn very efficient thing, but yeah, my beach did not work. So, in the end, it is not very bad. Um, We still have 15 minutes. So let’s see how go through a couple more slides. So this one is all about continue learning so far. So just wanted to say that um, This idea of wanting to learn faster and faster. So we talked about what happens to be forgetting. So this life, it’s probably a badly done flight because it doesn’t do any context, but in this life, what I’m trying to talk about is we have pathosomic forgetting, the other thing we care about is for it transcribed. The reason my poor transfer is much, much harder than that’s what we’re forgetting. is that I believe that poor transfer has something to do with this idea of proposition algorithm. So, you know, compared to, like, indomain, general attention, so, like, that was the terms of all of this stuff. When we say that I learned to ask, A, and so type B, and I want to drive faster, it means that sorting class B implies recomposing feature, that I want to realise it has one. Um, and I’m just sort of like, I mean, I think we talked about composition, identify the bit, you know, one of the GPS lectures, but I’m just wanting to say that, you know, there is a difference between competitionality and interpolation. And even networks, as far as we know, I don’t break the compositionally. And that is, in some sense, quite for the monthly because anything that has to do fast, for that very fast. I believe, at the end of the day, we’ll need to have some form of permission, I think, to be able to do that. And, um, This plan, this is okay, I mean, like to interact this in RL, but it doesn’t that matter. This kind of shows why compositionality is high. So here like what we’re doing is um, Let me see, which task, please do their goals playing. Um, that may be the Toyota. So in this task. So what you have to do is, you have to, you have to stream for current, you have to watch over it until you find the Apple. We find the Apple, get some reward, and then you have to go back home. That’s sort of the task. So you have to go forward, you have to go back. So what you do is we decompose this into different tasks. When the 1st task, you just have to go forward and find it after. You find an apple, you get your reward, and you actually do work, and you train your system on this, and it gets very good reward, and I’ll be in virtual. And now, you know, to create our propositional task, you say, okay, in the 2nd task that you have to do, you have to go all the way there. Yeah, yeah, published in the past during you know how to do, and then just work it backwards, so you start it, and then you get your reward and everything. And when you start spine training on the 2nd class, what happens into the 1st thing that a model does, is to unuline contribute their VF work. That is the 1st thing that happens. The 1st thing that happens, you forget your previous skill, and you basically start from scratch. And this is also sort of a more complicated start. But the whole point of, um, There should be a paper of this paper, which you don’t have to read. But the whole point of the story part is that when you’re teaching heels together, or when you’re speeching tasks together, even if you’re not fully making them sequentially, you’re just say, solve tasking, now, solve tasking, fast, B, where A 5 B means in one episode we have to do A and after that we do B. Usually learning has this pathology that the 1st thing that he does, when he starts to learn the parts of the task that he doesn’t know. So like if you have to do 2 things after the other, when you start learning the 2nd thing, this, that will be, because everything is sort of shared with the pragmatism or not, as soon as you start doing objects on the 2nd thing, the 1st thing that happens is you break the 1st thing that you need. So you cannot write new things without kind of, um, like messing up uh, the representation that you have and making messing up the behaviour that you have. And if this work will show that like if you add some kind of BWC view or some kind of replay or whatnot, it actually helps a lot in this kind of speech in scenarios because there is an interesting form of forgetting that happens, even within the same prediction, right? So this would be like, um, Yeah, anyway, I think I think I hope it’s a big figure. Is it? The forgetting, I’m just trying to say forgetting happens everywhere when you’re trying to learn. And it is sort of problematic, even when you’re trying to do this compositionality. And and and maybe this also point is I’m going to bring up, um, I don’t remember exactly why and how, but, um, Then why do this is happening because the other things that we’re doing are not localised. So whenever you learn in a neural network, any update that you do is global. In fact, all productive persons, all your natural. So because of that, it’s really hard to learn new things without destroying what you had before. Because basically you are adding noise in the entire network. Like what will make this work, you are taking the dream local life, right? When I learned something, I’m only affecting a very small number of parameters that are managed to solve that. So in my view, this is a being fundamental issue again, the learning approach is that everything is global. I mean, we slowly kind of started moving away from that, but we, we, we, everything is global in a, in a learning process in a neural network. And that is what creates the appearance that they about is forgetting about. Yes. No, I wanted to ask, like, why did you get it to feel like it is waiting to forget? Like, why can’t you just do this to us? Yeah, yeah, yeah. So this is just to show that that is what’s going to happen. So this is meant to be on a relation to show people that this is going to happen. Basically, what happens is if you learn to learn. When you learn phase 2 because of the forgetting, it takes you as long to learn phase 2 as you could just learn Facebook in the beginning. So our point was there is no point of pre-lining skills and adding them to the model because when you need to learn a new skill that composes with those, You’ll automatically forget them anyway. So you might as well learn the composition directly. And this is basically the whole point of why you can’t adapt path, right? Because the way biological system will do it. We learn these things separately and then we’ll be able to compose them on the fly and so pays to immediately, right? You do not need to, but like, Neural network. They can’t compose things. They have to learn the all composed solution at once in the same, saying, oh, they can’t just learn parts and then propose that into something different. This kind of the argument we’re trying to make. That’s fantastic, but I’m going away from that. Sorry, you are moving away from? Tukala, it’s there again. Without writing English and these methods or something else? No, so all of our like, we don’t, we don’t. I mean, the, the, the, the, the, sorry, the, I mean, matters like like this projection and whatnot. This was poised in the line a localised way, right? Because it projects away parts of the game. But I’m just saying, if you just do Adam, which is what you need here, there is no calibration anywhere. Um, you could hope that the great gentleman and so on becomes farce. But, A, that it’s, not going to necessarily be the way to work it, and B, like, you know, because I said, Bregan becomes partial, the bigger human, the model, because the use from my home, the object itself, the plant is a certain parts. But the momentum is rapid in the way different right here, okay. So even if the Indian can finally expires. If they’re last parts in the same way, when you, I was young, don’t let them, you could end up having bricks, you know, their parameters. And you could make an argument that maybe we should adjust momentum. But momentum has its benefits. It really makes everything very transferred. Some people would argue that no land to is the thing that works the best concern by feeling out life. So, you know, there is pros and cons. But generally speaking, yeah, you, Your upplanes are not sparse. I think there’s fires, you don’t have any control on that spicy. It’s not as a market, really, that’s a market, really, that’s a market, really, than you. So I think sort of an interesting question. Actually, this is part of your homework. Is what happens if you specify your outdates? I think it’s going to be on your format for the last, which is sort of be a political question of like, basically, I’m just asking you about the way. Trying something super toy-ish, purchase, fight, fight, that they’re Indian, since you are Hutches, like, that is still what I… But that, so that… Yeah, I guess we have 5 minutes. Let me see what are the next slides. Okay, I’m gonna go in this slide because it’s something we’re going to discuss, so again, there’s a lot of text. But, um, I discussed this one I was discussing about, you know, right? So I’m just kind of repeating myself. The one issue that we have in the field, that sort of like the elephant in the room, is that all of these effects that we’re talking about here, sort of this interference, this type of story forgetting, the poet transference, so forth. They’re all dependent on the data. Like, if your data is touched, then the tasks are related to each other, if you can see poet transfer, then you can do research on poet transfer, and you can try to argue why this works and doesn’t forever. If your task are completely or something up to each other, there is not going to be any composition every week. There is not going to be any code transfer. You know, it doesn’t make sense for you to be. So, The properties of the data matter, essentially more than we think. And this side is more like just a pretty pretty pretty deal, that we do not have tools to talk about this American data. That’s not a priority to film right now. I think that’s a problem. I think it shouldn’t be enough that you publish a benchmark and say, like, this is the new phone benchmark, this everyone should run, but it’s never a planning algorithm. Just specify why, what is the structure in this branch market? What are you trying to highlight? Um, and that’s just something to help team pictures. If you don’t have to pay for it, everyone to go visit a new benchmark, they look flashy, you know, school, they give you a lot of choice to run me, they do different things. But no one is saying, This is the structure I’m trying to expose here. I’m trying to show you that you cannot take advantage of this and that and so forth. And, uh, Yeah, maybe I want to stop on this like. Maybe the last point I wanted to say is that, um, It, it, that’s usually it doesn’t happen also because we don’t have mathematical language, that’s good. So I think the problem ranks deep in the sense that we don’t have enough research on similarity matrics between tasks of like what is the even the right framework? I mean, there are some ideas that people are using. So, for example, in not, you know, networks, like a lot of talk about symmetries and and all these kind of like groups and whatnot. That makes me potentially one language, another language that’s now being used a lot. Is that a personality? So basically the structure has been shared with the photograph of like, like separated in the environment. But none of these dragnoids are, I think, developed enough or exploited enough or it does not clear to me what is the language or not what the language you should use with these characters. You can, like, tailor expressions and then talk about for variants and things like that. That could be another option. I just don’t know which one is the right one. Okay, I know we still have like a couple of minutes, but I try to start here because otherwise I’m just gonna go forever. So yeah, I don’t know if you have any questions. We have the minutes, if not, which, you’ll see each other in the afternoon, I guess. Yeah. Oh, I don’t understand why the networks have a larger parameter space. Sorry, a larger business based than what? Compared to deep end networks. Oh, why the gradients are more of the bono? No, I mean, why, why, why the parameter space is bigger because you mentioned something like this in the 1st few slides. Yeah, I, I, I, okay, so maybe I didn’t express myself correctly. This was here. Yeah, yeah. The size of the practice space is the same. What I was trying to say is you should make the models wider. The gradients become more of propernal to each other. If you make them deeper, they do not become more popular. The dimensional ante is the same. But like the whole foil, okay. So basically if you make the system wider, you have less forgetting. To make it deeper, you have the same amount of forgetting. Was this only observed empirically or does the theoretical… But the justification is when you add wheat, Uh, like the, the, the, the, the, um, the fact that this place grows. It’s less restrictive because like units that are in parallel, they don’t, they don’t restrict each other. If you add that, then there’s all of these symmetries, but like you remember the linear regions, they’re kind of tied to each other. And even though you have no parameter space, the degree of freedoms that you have are less. So let me put it differently. Yeah, the degree of freedom, the parameter space does not grow the same, even though the embedding space grows at the same rate, so you have the same number of dimensions, the degree of freedom that you have in a very deep model, it’s way smaller than the degree of freedom that you have in a shalomot. Because you have all of these symmetries that decide something of the grade and gold. And the side effect of that is that the gradients do not be coming naturally more autobonology in charge, and you don’t get certain parts. Um, It might just strongly correlate with the degree of people that you have in the state. Oh, there’s nothing else. Yeah. I’ll be in… Can you have that much?