dl – Text Share Online

LECTURE 1:

Like, uh, you know, is being figured out in the background, then you will not notice, or maybe it will again happen, or reach a peak, you know, and not be able to deliver, people who lose interest and then will have to go again through. This I called the I Winter, so, you know, go to a Jennifer, and I Winter, and then to kind of recover from that. Yeah. in 2006? Yeah, one is the landmark Lat seems so because is that. was this people in app So there was this schemes of training V architecture layer-wise, using either IBMs or the encoders. So Jeff likes sort of the idea formulation, you’re sure, like, you’re going for the termination, but, um, There’s this idea that now you could train a deeper architecture by layer-wise pre-training, each layer, and then stacking them together and then learning it by style. Nope. If you talk to, or, you know, you guys end up playing that, you actually figured this out 10 years before and had a paper. But I think NTL sort of the sort of deep belief, there’s why it sort of stage, this wasn’t reproduced on the same scale. Like, you believe, so you have people in that respect of that motors are sort of another thing. I think Yalik Kun was doing something with I was parsity, part for that. Anyway, but like you started getting more and more groups that were able to come up with different recipes that would do sort of this replaying layer-wise new training, then were successful. So it felt like, okay, something was figured out because before maybe there was one group that I knew that I could do it, and not a lot of people introducing it now, like every group came up with their own recipe and all of them work. So it felt like something was happening. Usually when multiple people independently you guys are, you know, managed to make the same thing work, maybe you decide validations. So it’s usually a side that something has happened and now everyone can do it. Like if we get just one group that is able to do something and no one else is able to do the same, there’s usually something weird. But like when it kind of scales up and everyone is able to come up with their own recipe, it means that something has changed in sort of how we approach this problem and, um, where sort of at the point where we’re starting to solve the problem. why did you even need 6 years for Alex? Like we only know that getting like sk. So, um, I mean, okay, so, so, the whole understand is kind of in music. So, I mean, obviously no one knew exactly what it was called this, the community before, like, you know, let’s put it this way. It’s not like in 2006 everyone knew that, oh, if we do I listen that, you know, everyone will buy it and it be fine. If everyone, if they knew that, they would put their forces together and just do it. Um, the way Alex met happened is, um, Unless, uh, Twitchis, if I’m confident, correct, it was the 1st of all, the paper. So he was… Hop anonymizing things up. He was a PG student in the English group. I don’t know think it was a master. It was a P think uh student, Um, his background wasn’t necessarily machine learning. So if he was basically, um, or it doesn’t matter what his background, but sort of the character, he was more of a hacker kind of like, like to write, uh, hair and all, uh, you know, uh, I mean, in the background, okay, so there’s multiple things that are happening in the background. The other thing that happening in the background these sort of GPUs are becoming popular because of games and whatnot. And, um, I think what happened is just took this path and that alleys to the student that told him that I, look, if you manage to improve the numbers or image that by a few percentage, you’re gonna get their PDPs. Like you don’t need to worry about sort of standard research, just get those numbers up. Um, what through Alex Net will happen is really sort of, uh, um, like there’s no novelty in the paper itself. Like, all of the methods that were kind of, you know, a net were kind of around in previous papers. What really took that thing to work was some specializedct kels that were able to use multiple of the same type of training on models. And that was something that was not necessarily easy to do. And if I remember correctly, but that kind of thing where Alex was interacting with engineers and NVDA, because, um, like the, the, the, the, the libraries that the media had weren’t transformation learning, there weren’t that documented, that they weren’t able to do the things he wanted to do. So you got to that kind of low level where, you know, you was talking to this guy, they’re like, I need this pretty team and whatnot. But that’s kind of how Alex that happened. And it actually worked much better than everyone expected. So what Jeff was hoping for is the annual airport is to be on par with previous matters. What the outcome was that it was way better. I forgot the difference, but it was like a significant jump in performance. To the point that, you know, specifically years after Alex met, a lot of computer vision groups tried to get to that kind of performance with traditional methods that they couldn’t. And that’s sort of when they realised that I grew, we’re not, we’re never gonna, we like sift and all of these other tricks that we were doing, we’re not gonna be able to get there. So they had all of this I anyway, so that’s why sort of the change kind of happened, right? So you thought Janice had paper that had a big number, but there was like a bunch of groups trying to beat that number. They couldn’t. they were very far away. Then, um, There was Andrews, these are a man who is like a very big name in computer teaching community that everyone tries to. So his group switched to neural networks and they started working and, you know, like at that point, you know, like it’s one thing to have a really good result coming from, like an outsider, that maybe you don’t try it, like they’re not part of the computer region, they don’t mess with in popular region, but then when they’re like deep names in the field, they start using these techniques and they’re starting getting the same kind of results, then everyone is like, okay, this is the thing we should do with this. you know, so, so I think that’s how it kind of happened. But yeah, it was, I guess sort of, Yeah, I guess it was a mix of Jeff Eaton being eccentric and doing these kind of things and getting out as a student, others being kind of the right person at the right time. So he’s kind of left with the other far as I know. Like, after Alex Knight, he’s not, because he was never into this stuff. Like, he did it because, well, he wanted it, especially. I mean. Yeah, when he’ so I mean I kind the paper really has to stay by another folks around were sort of more like Mus audience but yeah, that’s that’s kind of the the name. I also don’t know what internet came about, but I don’t think in 2016 network there. But I don’t know. Oh, yeah, yeah. Where those music that right? So where goes where you went to consider? Oh, okay, so this came about March later. Okay, so, okay, technically the 1st kind of geometric, the learning thing, is 2009 or earlier, by some Italian guys. There’s paper was largely ignored and no one paid attention. I think it’s like 2016 when the field actually took on. So we had there were like a bunch of papers that the rap competition in our networks. there was uh interaction networks that we did. There was, um, sort of a pizza, very paper called graphical networks, I think, or something like that. But that’s sort of what it kind of took, of. Um, I think people are sort of just a question about the right tiving. So the 2009 paper and it came up, out, it wasn’t the right timing because, well, the learning was still not established yet. So, and and this was uh, so, so dramatically learning, are these sort of the originals, or Gracnio Net, or kind of structure? It’s kind of like a weird mixture of neural networks. So something that really doesn’t look like new e because you have these gr factors so you have all of these spent non-parametric structure of a bit. So, it was weird for folks that were not doing neural networks, but weird for folks that are doing neural networks, because it does mix involved, and, you know, no one both languages. The timing wasn’t right and because what happened in after hugeet, a lot of the f in appealed went on a lot more like uh typical topics, like computer vision and language. Um, and like 2016, 2017, and dramatically learning support. Um, it was more about people wrenching off into other topics. where graph structures were sort of the predinant form of the data. So I think before that, there were just too many problems to be solved in compared reason and language for, and the community wasn’t big enough, which is for all of these domains. So, you know, that’s kind of how our fund. Yeah, from real time, you said, cookies or noodles, and yeah, uh, Uber. Yeah then do you say that 2 survived been give me the like explorgianona oil and Yeah, definitely. I mean, as a community, we tend to focus on whatever is high for the moment and now most of the research is around neural networks. There are certain aspects, there are some of the, right, in the refirmation of, of, um, learning, uh, self. um Okay, so let me just give you an example, but now we’re going into, like, this is all sort of panothetical now. I don’t necessarily have any kind of comfort with this. But, uh, one, you know, we talked about some generalisation hours and distribution. So we talked about this idea of learning the actual task, or, um, this has been, I mean, I might adapt to my mind about it much later. Um, so there is a subfield in domestic developing, but this type is there without branch out where it’s about, um, algorithmic imitation. So the idea is, I have an algorithm that I know, like soty, I know how to write it out, and we understand it. and I have an email effort. And what I want to do is, I want to give some example to that new network, for you to learn how to solve. Uh, but, like, I wanted in such a way that I can actually read out from, you know, like, the goal is to make sure that the other actually learns the source, right? So a point where I can identify, in the neural network, the computation that he does, Um, and I can, you know, guarantee that what he’s doing is some form of certain, right? Um So so this is, you know, this is the space of our in alignments and I going to take decision and so forth. And – and I think the – you know, the – the high level outcome is that – and then there’s papers about the result, is if you can’t do that. Like, unless you cover, you give all possible examples, like, you’ll never, even though in your network has the capacity to represent the working algorithm, it will never be started by learning, using very young child. Um, and that’s Like, one way people would say this is because the, yeah, architecture itself is not sufficiently aligned to the algorithm. Uh, whatever that means, and then, you know, this is a UV sort of handwriting. But there’s like some aspects of it that I do actually personally think are important. for example, an algorithm usually has part access to the data, like you know, as parts operations. Well, neural networks are very dense. Like in the neural network. So – so in an alori that we have variables and you have sparse access to the variable that we you are variable the whole time. When you’re on network, the equivalent of the variable is, like, even say, of the neural network, that is, but you can actually, you know, network. But you, in a, in a neural network, every time you do a stack of computation, you use all of your hidden space, like all your hidden space changing, right? There is no type of spice access where you’re changing your heater activation locally in one region. It’s always everything changes locally. While algorithms are very low from it, how they usually. So this is sort of brought up as one sort of difficult research. Uh, the other thing is, algorithms often want to increase structures. Neural network everything has to be engineers. And some of these things are sort very gang texture. Like maybe the global versus local kind of saving, you can work around it. But everything being continues, it’s sort of a very kind of like, you know, it’s very difficult to the architecture. Like we, there is obviously there is people working on these these variables and neural networks and so forth. But, you know, I can tell you, it doesn’t really work or it works as long as you have, it needs to be variables and and then you very well. Uh, so, so this would be a fundamental limitation, that would mean at some point in the future we might need to change something drastically about your own networks, in order to be able to have the speed structures or in order to be able to have sort of like local updates or like local interaction, which is like, um, And there is a question, yeah, so 1st of all, we’re not exploring that. Because, well, it’s a little bit off topic and people, you know, like this works. So there is this integation of wow, maybe we just need to sc things up more and work. you don’t have to be in the dispute factor like it’s sort of all on your eyes. like you know how to do the feed factor. It doesn’t mean that there’s no solution that everything you continous. We just need to search very private. So you know, exactly what you’re saying. Like people are ignoring these things because trying to do speed structures and kind of going into the non-parametric world, and people believe non-parametric don’t work. So, you know, I mean, there are some folks who are interested in creating sort of a mixture between your Netflix and unparametric, you know, Russian processes or other kind of things. So there are some folks interested in that, but there’s not many statement. People usually don’t like that. you know, they’re like, okay, you know, letters. So itreases respiration the way you said. Um, and we might just sort of just take, um, us skating for a point where we really can’t do certain things, and then go into an NI winter, for the community to be able to say, like, you know what, like, let’s be focus, and let’s, let’s not, not, not everything has to be confused, not everything has to be easier on that, but. Um, but I think it’s a very, like, um, natural sort of way to feel this enrage, right? Like, it all comes even down to the publication process. If you and then sort of a new process in the publication, right? The you are done by PHD students, who are just sort of now learning about all the vision involving stuff that’s excited about, like, you know, it’s like when you’re going outside, you give it for generic viewers, when you have a paper that goes outside of their comfort zone and started doing some weird stuff, they will have a tendency to read that, because they’re easy. They – they will have a hard time to judge it, but, first of all, they don’t have the context of where it is sort of compared ideas coming from, and what is time to do. And so – so it’s all part of that. Like, I mean, you know, that for example, true that I did learn, you know, there’s a name, initially, while basically driving a neural network, particularly able to publish papers. Because if we had a paper that had neural networks in the title or neural networks in the abstract, usually rejected, like, basically years, and not read your paper properly. They would be like, this doesn’t work. at some point in doing it. So, so we ran in things to be learning how, because now you are not doing your letter, you’re doing learning, which is a different thing. So, you know, so it takes like, uh, you part ignore the impact of all of these things, right? Like at least in academia, you can’t survive unless you publish. Uh, and, you know, to publish, you need to follow sort of what what is published. And you can do something that are led target to publish on the side, but like that can’t be your main focus. Unless you’re someone who are quite established and unfortunately whatever you want. you take up this same camera, right? Very senior people that you don’t care about papers. you don’t care about what other people think you just do whatever you you think is moving. And, um, yeah. I need to be, this will be a network. was applied to a problem and it won work. review this t up for ourselves and one. Uh, I’m just looking on this, what is, um, mood, via problems specific? Most of the, yeah, most of the problem. see is because of the problem. Or do we weight on the problems? And, uh, maybe, um, a DVD model application or something like that. I mean, Okay, so like generally the models are meant not to be probablyified the top, right? It is the the big selling point of – of – of of welin or machine is, you know, we have one anchitecture and can works everywhere, are formerly the one – the dead run is going signal all at the moment. Um, I mean, I think, yeah, I don’t know the, the participant, the, the, the, the, the binary, the, one company. Like I’d imagine it worked but it worked at a very small scale because one thing is not there, you have to train sort of new models. So shortly, and a very unique number of ways and stuff like that, so you can do the, um, I mean, and then – and in my mind, that is probably what I mean, is that the most, is sort of the scale. um because for better worse now we’re in a space where if you can’t run things of sufficient scale, basically the one they just use. So until like you can run sort of one of the dist. So, there – there is some – I mean, and – and I’m not sure if I’m answering your question. I’m actually not sure what the question exactly is, But, um, there’s definitely working quantising on it, right? And I really like the LLM we use nowadays, you know, there are ways to be their contact and then four feet or something like that, right? So I mean he doesn’t say does all early kind of. But I don’t think it’s necessary the kind of distive structure and thinking of. Like, these kind of free structures usually come from, uh, relaxation or a continuous thing. Like, this work leads fashion, actually, you have, like, a full precision model, and actually, the more precision you have is better, you train in that way, and then you have a whole process where you’re kind of, you know, this kind of relevation process process where you are, or, physicalizing your model, and you kind of move into the – the – But you rarely learn, like, directly for these. If you try to learn directly for these, like this, they’re very hard. So, usually, you take a great model of them, and, and unfortunately, back to it. Um, but, uh, this case factor I’m thinking of is, I’m not sure if this is, like, I, I, I don’t, I don’t know whether just having discrete weights used, like, what I’m thinking about these, is more of a, um, Yeah, like, it’s more like if you think about how an algorithm would work, where you have, you know, like, 3 or 3 and if you see like that, like, new networks don’t really work that way. They don’t recreate internal data structures and manipulate their new in different ways. And this data structure, they, if they do anything, they’re not sort of disputed given nature, like the way you could let me know, you have sort of numbering, uh, uh, thing. So, uh, yeah, like, you know, like, Yeah, so so I think it’s, I think there’s something about the internal state being always from the US, and like everything being projected to a continuous place, everything becoming recently. Well, you can go, like, later, in the sports, with remote, um, mystery, and the things will try to do this, and I guess, science, and all, like, one of the, sort of, essentially, new aspects that I need for, you know, networks, and, you know, with them, a little based on what they want to mean, all the intricate, and then rejected. And I have, um, like, for example, my, um, iPad, Friday, which I think, uh, Italian on network, is, is a, uh, to potentially generalise all this information, but, uh, I can, you know, if – if there is time, or, yeah, I can go use to my reasoning, though, why that people take stress. Uh, but now it’s a bit early to justice that. OK, so this slide is kind of just saying exactly what the other side of the saying is just sort of fine. I don’t know. maybe I don’t even need to stay on it too much. Uh, but it just started showing the metres and the the boom periods and you have the symbolified in the in the 40s and the 60s. Then you had the expert systems, and now they’re sort of technically in the machine learning group room, and, you know, there’s that question in the air, like, is this it? And over the, uh, this or is there going to become another winter? I mean, to be fair, um, Again, I’m not expert on this, but a lot of people are claiming that I’m making lots of money from there. And I think sort of at the point when you start making money, probably if I get to get into a winter. Like if you can still go in for a metre in terms of innovation, but at least you know, there is some technology that is not going to go away because, well, now this is how we do X. And, you know, X is a kind of service that we do, and people are buying, they can, you know, they need to use it. So, in that sense, it at least got to that stable point where it’s actually used the industry, sort of, on a regular basis, to solve, like maybe the nuts, like, I’m not going to say now take about the childpots that are sort of those, you know, the special things, but they’re sort of like more pragmatic things that are common with that bashi, where even the way the neural networks are used might look very boring and then sort of the best real to the community sectors. But you know, those are established way you’re doing such very specific things and those things are then of that interview. So, so I think there is, there’s still quality of the pressure point, I, we’ve had particularly one, I think, from a research path, people, and and and all of that, but I think it’s at least that is a very point where like, AI, now is a, this machine learning or deep lighting, that’s AI, is very, very tight. But deep learning, which is something very specific, is a technology that’s being used in the industry. So in that sense, you know, it’s not going to disappear from there. There’s no point of replacing it because you’re doing this Japanese doing it well. So, so they were at least were over that month, which is like, at least for this boy kind of stuff like that, it’s maybe not the case, but maybe not the same case, at least not. Okay, so, um, we’re going to start getting into sort of a bit more technical stuff, and then from here on, there’s going to be a lot more technical, though. I’m happy to, so, contextualise sort of information. So, you got point of view of any time when you read them, when it’s done, you know, the same? So, uh, to kind of start putting the roots of the kind of thing we’re going to discuss in the course. we’re going to focus deal it on the network uh planning and a supervisor. but there’s like lots of other things, a lot of them are not in the power. And I sort of wanted to say, like, there’s different ways of categorising machine learning, whether you’re posting, or not, your models, and what you’re talking about. You know, that’s what you’re reading and all, it’s fair on Apple. Well, if you’re looking at learning regions, like, the supervisor learning, LAL, self-supervised, not learning, the teacher learning, not providing, and, and, and, and, and, and other things. So we’re gonna cover some of the learning regimes because I think it’s’s actually kind of useful. I’m not going to focus on in your models because very typically, if you want to make theory about neural networks, you can see you’re talking in your models because you can make your body and all that. And then you hope that you’re equal for your networks. And we’re gonna go through some examples of that. Um, but I’m not gonna talk about here now, but, um, OK, so, yeah, I don’t know how this would go. But, uh, I’ve met you, many of you already, probably know quite as much, no machinery, but anyway, I only decided something very basic. So if someone gives you this data, right? So you have three squares, uh, purple panels, and then you have this circle, and they ask you, is this a square or a fan, right? How do you know about this question? Have anyone had any idea like how would you Andwise. Okay. Yeah. find that this group of the green network the crab. Right, I don’t possibly see that it’s a swing. Yeah. I know. That’s yeah. Anyone else has, or should I just go? I think that’s sort of a very natural way of solving it, right? I would have done the glasses in the same kind of solution, right? Um, so, it’s actually one on my slide. Because it is exactly the kind of sociality. So this is sort of using KBS labour, but the idea is you decide on some distance metric, whatever it is, here, it feels like a KBS experience might make sense. And also have to sign this for any other, different type of views. You compute the distance within any parairide points So you can you have your queries, you have all the other points, your computer distnces, you starting by the instces, then you look at it okay. So the ones that are closest to use for the 5 plus f and the data test. and you can look to see what pl they have. Are they square? Are they treadels? then you can be the majority class. because most of them are squ of my time is just Okay, so what is the card what is yeah. What is problematic, if anything, about this kind of approach? this is a this is a machine in the first, right? This is your, you’re describing this is a meta algorithm, right? Is that may discuss? So you’re not, you’re not saying, well, like, you didn’t have like 4 sides and whatnot and it’s a square and it has beats a pack or that. that would be the prescribed version of solving the problem. This is really like from day. It’s like, okay, I don’t know what makes the fire weather in the on p, What I’m gonna do is I gonna just look at my training set, look at these neighbours and make a decision based on that. So this is, um, and this is sort of the, this is the, the, the metal algorithm, um, and, um, yeah, solve this problem, but, Can anyone think of like things that could go wrong or like, what would happen if you apply this sort of, you know, much more, but a larger scale circling, you say, I don’t know, the tax phases in my English, or anything like that? And I have thoughts of what, yeah, how would you do exactly this week? What would be the 1st thing you just travel with? Like you get images of faces. Like, what would be the 1st thing to travel economically be South Africa? I think when it time scale of larger data sets you seem to have like Just computation issues. Yeah. Yeah, so that’s that’s all the big issue. And then I’m actually gonna just you in a couple of flight in the future. after this I’m gonna’m going to use exactly the issue to motivation the gene part matter. So one issue this is a non-parametric matterod. I’m one way to, okay, one way, you didn’t get there, but I wanted to appreciate parametics for non-parametics is that here every time I’m making the query, so I’m trying to find up another point. I have to use the entire 800 iPad, right? So I have to do this for those data, sorry, data set, do something to every data point to kind of come up with an answer. And this is a problematic, right? If I have like a video data point, so I need to do a phone for a video every time to make a decision, that’s not going to work well, right? That’s not what you want. You want to be able to do like computational one, like, you know, you want to have, like, a fixed number of patients that you need to get to make your condition. You don’t need to do this So that’s definitely one problem. That is, um, yeah. Yeah, definitely. Any, any other, I mean, there’s not any trauma, but yeah. Yes. It’s very sensitive to noise. Right. So, So, one point, what do you mean by noise? I mean, like, if one point is off, it will really affect the prediction, the decision that we get to. I mean, if you if you increase K, you become a bit more robust, right? So that’s the point of K. But yeah, I mean, if you do top K with K equals one, I imagine that you have an outlier and you get posted, you’re doomed, and that’s it. I agree that’s a problem. I guess sort of technically by increasing the value of K, you’re making the system a bit more robust because, you know, you’re not just… It’s not invariant. It’s changing a unit. If I measure one axis on kilometres and then I then I measure it the same thing in metres, then the Euclidean investors should be… Yeah, we will break the analysis. We need that an answer, just by measuring distances in kilometres from metres. Yeah. Yeah, definitely. I mean, I would go, like I said, further than, yeah, I mean, that was sort of the top of my mind, which is defining in the distance on itself, it’s crazy. It’s hard. You’re like solving how the problem just when you repeat your distance. Maybe more than 50% of the problem. So if I would go to these data set of cases, like what does it need to do? Like, obedient distance in pixel space into images? Not just a dimension, it’s just the round, the wrong business match. Like you, you’ll find. I’ll be telling you many teachers that are not faces, that are closer to an image of a face, than other images that have faces, because it’s the wrong distance, is the, yeah, basically the distance is really problematic. Like this – this method guys as thishetic. If you don’t decide, this is our, what, what, we are saying as well, if you if you somehow don’t have the right discipline, it’s not going to work. And then, you know, every time you change some property of the data, like you need to reevaluate what is the right distance metric, and you might need to change it because active change and so forth. Um, I wanted to present another idea, just so that they kind of cover a bit more things. And this is more like, okay, this is trying to make things a lot more probability. Like another issue that you have here is in the way that you frame, but this is on the next idea’ all it easy is you don’t actually the way every I was young algorithm, know they have a senseral uncertain. don’t it’s it’s harder to talk about like, well, you know, it’s about uncertainty. So one way to add the idea of uncertainty, I mean, there’s many ways to do that. The simplest way on this people algorithm is you look at your K name, that you speak, and you look how many of them are, one kind coming out, the other, and, you know, from that, you can get some kind of upright, right? If you say, there it actually seems to be, then it means, well, you’re uncertain. it could be either a sp signel. And comparing how many are squared, then the more certain they become, right? But that only so, we know it’s a different aspect, which is, we know it, that this is. I’m just looking at, you know, like maybe I have majority are squares, but they’re the furthest away, and the closest neighbours are actually all of the triangles, even though they’re just 2 of them. So, you know, so you might want to say, well, in that case, I’m not so sure I should say this is a square because the distance itself matters, right? And I think sort of fire and need those kind of going that direction, at least is sort of how I remember learning that long, long time ago, where really, like, this is now you’re not directly trying to do the classification. First of all, you’re just sort of normaling the data. So this is just a best you uh um uh kind of a way of – yeah,ing the best data. Where what you do is, basically, initially a training point, you put some kind of cattle, and just for a terminal. And then that kind of gives you like the sense of the kind of creative distribution, right? So you can you can see it here as well, right? So because this carry out overlap, the news in that area, you’re a lot more likely that, you know, the points in that area are going to be of that particular class, right? So you can, um, you can con- in fact, Um, and and obviously the – the issue we – we have in this kind of square kernels, uh, simpler kind of windows, is that you get areas where you have like 0 probability, which maybe is not what you want, and it can soften that by using lotions, which is sort of what people are doing practice, and that can be cautions, you know, they have, like, even support, so you have information about any point in the space, even though the, the, the quality, the kind of becomes vanishing, uh, picney. And and with this, you know, you can you can compute sort of your probability, your lives to be in a sweat, given the data set, by just, you know, um, from building this distribution, all your data points that are spares, and then you compute the probability of between the triangle, and then you can have this thing, then you can reason about that. And you can have a condition. What are we trying to Vietnam? Is it to get like a measure of the confidence of the art? Yeah, yeah yeah. It just gives you an uncertainty and gives you like the better way of reasoning a lot of the data. So that that’s why Johnny involved. So I was trying to compare this in a more proisticistic setup. Um, and this is sort of kind of the basis on globe building this and GGP and all this other stuff that I’m not improved. 100 is 5 So what’s going on here with this? So you wrong or this one. Yeah. Yeah, yeah. so this is like your about making your kal your point, your position minus is send because it depends on how you find pie to this is it could be a distance. Yeah, in in practice, this will become kind of the visual actors in this stuff, right? So you’re basically looking at I mean, this is just a fancier way of really just looking at the distances where that point and incorporating people, like a, like a wind sum, according to the distances, but it’s a, it’s a, sort of, it’s a, So what this is in trying to ensure by making sure of each of these files being eruptions, is that you actually get the results of it. So, you know, you got to take like everything comes to one and and all of that maybe I presume. Because I just, I have certain points. I just sort of ran and try. So you’re saying, oh, for this, I have 7 points. Each point has its own Gaussian, so I want to get a green card. So basically what I’m doing is I’m summing those things. and then renormalizing them. So I have to renormalize them so that this is actually tougher, uh, uh, just, uh, you know, that makes sense. Just sort of like a, I mean, this is not, you know, Netflix, but I just wanted to give you kind of a flavour of other ways of solving these problems. And to be more sort of potentially intriguing, but at least to me, um, uh, this kind of techniques are a bit more intuitive. Like you can kind of reason about what they’re doing and they’re trying to kind of delay it out. So I guess as you most in your life, right, students can become a little bit less intuitive. It starts becoming more and more of a black box. You have the black box, you refer it, you have the optimisation, you deploy, don’t know how to black box or where the data, you get something out of it with the inference and it’s harder to understand what’s like that. So I thought like, I’ll start you at 3 examples, the thing that what everyone did, and then maybe we’re gonna be able to kind of, uh, do something singer from your letters as well, kind of try to get to that. Obviously, this is problematic the same way, these are the thing of problematic. So here is the part of the window, saying, what’s important is the size of the carnal, right? So if I’m making this sort of red, so she’s trying making them too narrow, then I’m going to do very well on the training set, because, well, I haven’t one of those very straight in point. But then I’m going to have a lot of areas of very, lot of community where it’s going to be very hard for me to reason. So this picture is trying to to say, so this is in deep. So, so now imagine you have, like, little rectangle alert, like, I know that you should visit a point. Those are 2 small, you’re not gonna, you’re gonna do very well on your training, but, you know, you are, technically, you’re an overteacher to go into training data. When you’re have to make predictions, you’re not going to be able to make very good predictions because you know, you’re gonna have many areas where youre be very uncertain about everything. Well, if you make these 2 lives, then you start having issues even with your training data because you have these windows coming from neighbouring or holding data. So I was just basically trying to say that there’s like everything, there is no there’s hypper parameters, which is basically selecting the caral and depending on how you select that you it might or might not work on whatever task be. So so you know, there’s a to that. Okay. Hopefully that will make sense. So this method that I describing is a sort of part of the non-parametic family and it’s non-parametric because you always have to work with the data, right? So you have this data side that is describing your problem, and you always have to go through it in order to make decisions. Um, um, and in interest, uh, at least to me, interesting, the sort of non-hometric uh, scenarios feels like a search. So you always have to do this search in your data to make a solution. There these methods can be quite integive. which is a part that I like. But I usually seen as there is sort of not as much in favour. I mean, there’s a lot, you know, there’s places that really like this stuff, uh, like right up from the Cambridge, that it starts to GPUs and stuff. But, um, more or less they’re not seen as a driving corps commission learning. And the issue for that is because they don’t f up it and they don’t say up by the point they was brought up, which is that you have to go through all of your data that’s not going. So you need something else. Um, and alternative to this kind of scenarios is to use biometric, not certain. That’s when you’re at our side, and these are the ones that are dominating sort of the field at the moment. And the relationship between that, it’s complicated in the sense that in the, you know, there’s lots of work that, for example, show that even we may or scale, you know, naturally become an automatic method. There is a lot of work in trying to combine them. and and it’s Let me put it this way. They have very different properties. So it’s not like… It’s not like primarily methods are supersetical, non-primatic methods, or like parametic methods can do non-parametic methods. you know, tell you as well. And, you know, there’s no point in having both. Like, you know, you can, and replace, whatever you are doing in the primary group. There are certain aspects of like non-primetic matters like on the death, right? Because there’s no explicit, like, there’s no great dissent in environmenting matters, that you’re doing this idea. There are different properties of this that pharmacy methods just don’t have. So, you know, just basically I added this time just to say that you shouldn’t, like, they have very complementary properties. Therefore, even though that is happening in a very limited way at the moment, I do think that there is a point of merging this to kind of having a plan. because you know, they can do you want everything that they are recommand. And also, in my view, one area of research is probably going to take off more and more in the coming years, sort of mixing between this. I mean, what kind of problems do you need to do? I mean, the typical way people are doing right now is using the pharmatic method to compute some kind of embedding, and then put non-parametric on about it. than words. So basically use your primatic method to project from your input space to some other space, but it’s better behaved, where it’s easier to define distance metrics and whatnot, you know? And then from parametic, uh, non-paramedic, and of other space. Um, that is what being done more often than that. Um, I think more, uh, more can be done. The propagation is more interesting in things that can be done. And okay, let me give you another example. And then this is a bit more debate about, you know, how you, and again, like, I don’t know how near it one is, but like if you think about the Transformer, some people do argue that Transformers, eyes, sometimes a bit sophonetic and unfamily, because the attention acts almost like a non-parametric component. Because the attention, whenever you have a query, basically it does attach to the content by computing distances to every point in the subject. But it’s not really, you know, and you can say like, okay, we take this perspective, right? And we can push it a bit further, and say like, okay, let’s let’s make attention layer, you know, proper non-satimentary, kind of. So you could go that way. I don’t know if it’s that useful because uh the hardness is you have to propagate atients to that search process. So, you have to do it in such a way that it can still have a signal for the layers, you know. But there’s only an example where you’re combining these things a little bit more, they’re just putting one on top of the other, which is sort of the thing, but in our way of doing this. Um, but I think that’s kind of the place where I think people. You just, sorry, just make some understanding, this is the problem, because most people define the boundary between parametic and normalisation. I mean a week in book programme helping attention people. These weeks are? Yeah, I mean, sorry, where did that drive? Well, it’s this is parametric. Okay, so non-parametic one property that you would have is that if you increase the size of your contents or whatever, what set, whatever you are trying it. And they still apply it, right? Like if you have a weight for everything that you are, well, if I have an newvirment what would be way for that, you need to lie. Is it the same to the transformant that it will not take a time to develop that, right? int layer itself is Okay, so it depends on why you draw the boundary, why the essential size enough. But the, the, the, the, the key query is, there is the same key choreometrics that apply to every embedding, right? To the same transaction that you do, if you are. In sometimes you can think that the non-parametic component is really just doing like a sign products everywhere. So you’re just looking at this. Like, forget about the projection into the key space and various spaces. Those are just pitch protection. You can fold them in the layer below if they’re just part of the non-primetric component. And maybe the mechanism of the attention is computing, because science discusses everywhere, using softmax and normalising numbers of maximum-parametric, and then sort of averaging based on the software. Instead of going to do a top change in the top matches, you say, you know, I’m hoping this listen to the 3rd pay largest and then do something with an happy majority of whatnot. But you need a soft max because you want to be about the proper gate variants, you know. Um, it’s – it’s not that much of a distinction. I’m sort of trying to, I mean, it’s like, you know, I don’t care whether it’s easy to probably as look, everything is hard to grow, sleep, boundary, away from it. You cannot have a. I, I kind of described it this way because I feel like this is sort of maybe a more traditional way of doing the field. There’ definitely in the in the field there having the view that you know you have met and there’s like books upon non priarily approaches to be machine lin and then you have groups about primetic methods. had usually seen a distinct thing. And at the moment, usually the intuition is a primetic matter to work much better. And then basically everyone focusses on automatic matters, specifically neural networks, rather than, and you look at, you know, where I am. especially let me on networks. And, you know, so that’s why I went with this kind of distinction. In practice. I do think, okay, so maybe just sort of a high level. I know fine in practice, but if you want to do research and you want to type with this matter. The, the, you know, most research ideas, they really come from someone who is familiar with 3 different art fields and basically just taking an idea from a subfield and applying it to the other. best by far the most, you know, uh, and then, you know, there is a lot of creativity and a lot of, uh, it’s not that you know to do, but, you know, in that sense, like, in our mind, it’s, like I was trying to create some parallels or I’ll try to play more parallels, if you have parallels, but, because the idea is don’t try to box them into different things and say, well, you know, this is just a parametic, this is just for non-parametic, and they have, you know, usually like when you want to be invented, it’s really just about looking at both. I say, well, like I do, do this trick but there, I could do it here as well. We find we design things one way or another, and that’s sort of how how you play new ideas and your methods. But yeah, I mean, the ball is not a key time to say. I mean, you can try to draw a boundaries, but it’s matter the point is not 50 I think you just be aware are, don’t know, boundaries and so forth. But, yes, well, in India, I would have, you know, a, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, uh, the uh structure of the is that you non was for a B forometric or non part model is more than hopefully uh It’s hard to say they all met by the way. Like, for example, if you have Have a single layer model and there’s a lot of things people not be able to do no matter what future. So, um I would argue they they might be on the same level or what be like they they they – I mean, it depends – depending a lot on domain, like, I don’t know, if you have a very specific question, and, like, a very specific scenario, maybe it’s easier to judge. They say, well, this – this thing, the – the driving force behind the performance, you see this or that, but, you know, they – they can all matter, by the way. I mean, Basically, okay, yeah, yeah. I Like, if you, if you have a good people’s practice, for non-familitable, then then they can be on, but I, I, I think, so. They can look around, say, well. But, yeah, I, I, uh, I would say they all matter the same. I would say that typical scenario people with 1st speaks where they are. So in most cases it would be like, okay, I’m using world networks all Max. But then you’re not even asking this kind of question because you’re already kind of set of animals using this part on your matters. Basically, you look like this, it’s a layer, architecture, that’s combined, I don’t know, MRPs, and then attention, or if it’s a performance style, or you just combine some revolution networks, or what, none of conditional areas, or something else, you’re not even asking this kind of money, kind of, because usually you’re in a much kind of narrower regime, and then you’re kind of just cleaning there, and then you can use that. But, uh, um, there was one impact they can have. I think it’s done the same same world. Okay, so this is still part of the intro part. Okay, so okay, so what are these part about these methods? But then this is what we’re doing, we’re going to read the basic stuff, and the whole point is that if we, if we go through the basic stuff, then we advise stuff, there are more code is gonna be there to, to, to understand, and, and, and then we’re gonna have much more information about that. you manage to do on the same thing. So, maybe, maybe, like, taking a step back, like, what are we doing here? So we’re trying to find this new good predictor. So what is this good predictor? This is a function where if I give it an input, so this is supervised learning, right? I use a useful X, I’m expecting to get an out of the Y. Um, whether that why is at last this is classification soput an it just part of timeangle. Um, Or, or it could be a regression where, you know, I’m pretty surprised of an object and there’s sort of like a real number. So that’s kind of the difference between, you say, modification and regression, everyone is good predictor. And you have some data. You have some examples which are vari of inputs and outputs that you expect. and like what you’re really looking for is sort of this metod, algorithm the new data give me a function age. That’s sort of what you want to do. And in a non-parametric scenario, the way I would playing this, it doesn’t matter, algorithm, it’s defined on the power set, or what possible data sets, and an observation, and easier labels. And this particular formulation I is trying to say is that the method that you get out of non-primatic architecture. So this power set thing. So maybe I shouldn’t have used age. So this is the machine learning algorithm. This is a myal version. The what you get out of each other, the inference that you have to do, go from your data set. But the whole point here is whenever you want to make a prediction, you need to have your data set, but examples, and your prediction or your imperence is really some form of certain procedure that goes over that entire set and makes a decision. And whether it’s looking for the toasted neighbours or doing something slightly different or some process of that, it’s fine. But it’s that sort of in reasonable how it works. And then the parametics scenario what you’re doing is you’re replacing your data set some. So really what you’re eating, what you can, what you can think of it, is that really you’re trying to compress your data set in the Disney data. That’s basically the process of going on here. So instead of just doing this search for the entire data sets, you’re having your parameters and you’re compressing your data set, those parameters, and then the advantage now is that you don’t, you know, this better, okay, we are very big. This data is space. is a big size, so the cost is fixed. So you know exactly the thoughts you have to pay when you’re doing a prediction. And an example of a paratic model could be a linea function, right? So, you know, you take X, you multiply it with some WN and some of your output. Yes. How is is it all known part of achievemental style, is the entire average activity? He was a. Yeah, the… Most of the crazy ones do, right? Imagine you can do all kinds of tricks of not necessarily, um, I mean, there are different varies of answer in your price. If your, your, your, um, You don’t necessarily have to go through the entire day cut to make your search, right? There, you can, you can, you know, if you have a place, make a structure for your native ID, um, you know, that made your search to be way more efficient. If you like login or whatnot, because you’re not absolutely looking at all examples, you have this data structure that allows you to navigate the space in the right way. So, do you know that some data points are gonna be very far from what you’re looking for? There’s no point in checking and puttinging any businesses without. So you can definitely do all kinds of streets to make this very fast. It doesn’t have to be this extremely slow, uh, end that I’m trying to sell to you. Um, and you can also decide to throw data out. you can also decide which there some data for in I form of another develop that you have. So there’s no point in keeping them around to kind of make your memory, or – or, you know, how expensive your forward is. Uh, so, obviously, in all of these things exist, and and then they can make things by – by the efficient, and There are, there are sufficiently many parts of the community that would argue that, you know, non-parametric are really parametric. So it’s not like I’m trying to build this sort of picture here. So by the way, most of the time when I’m going to try to build a picture for with accelgerated. So, don’t believe everything that I say. But I’m trying to say this picture that we have, this are very sluggish, not parametric method, and then a much better parametic method than everyone is on the parametic, uh, model, uh, uh, uh, side, and, you know, they think this is the best. There’s not really the case, you know, the non-privatic art, there are particular scenarios and places where they’re much better. They’re very useful in low data in gymms as well. Um, Because, um, it’s much harder sometimes to do this process of extracting an optimal data when you have, say, a 100 examples when you’re trying to classify something. So then those like kind of scenarios where, like, very well thought out non-primatical methods actually were methodic. So there is like many pointed like places and places where non-primatic things are, are still pretty appreciated and there’s lots of, there is huge literature. I mean, as I said, they’re like groups now actually working a non-partnative met. It’s not the same as, um, I don’t know, there’s some other topics that are maybe a little bit more niche that are kind of really hard to do. Like, I don’t really want to do symbolically, or something like that. That’s probably gonna be highly identified groups that are very active in that space. N-p ofametic machine lining you would probably easily Like I was mentioning, like a temporary place, and to be obsessed with someone, but it’s okay. so you’ll find probably other people doing all kind of things and stuff. So, so there is, um, that now there’s as unpopular, as I make them sound, and there actually is to make them more easy. But, uh, it is a biometric method that are sort of state of the art in most domains, but even when you are in a live data region, like when data is you, when you have sufficient data, and that’s not usually the bottom-up of what you’re doing, um, more or less parametric matters like the one, um, that are performing the back. But I think methods can perform very well in building regions as well. If you have auxiliary data that you can, like, we train your model on, and then you just have a couple of examples on your target tasks, but it’s sufficient because it’s retrained. and then you have all of these clean but you find experiments of project and log it so. Um, Anyway, I was just trying to to set up like a variety that makes sense and and – and it’s easy to kind of do. Uh, okay, so this is the paramedic model, so the idea here, uh, this is sort of the basics and from this onwards, we’re probably just going to jump, you know, on that rest, cuz that’s one of my questions going to be about. But the idea is that we’re looking for a meta algorithm that will help us find destruction age. The draws from Pita, my parameters, and an observation adds to some tradition life. H itself, the structure of disfunction, defines the family of the Marvel’s family, the family of okay, that sounds proper English. So the model family in architecture and an example is a linear model. Another example is a neural network, like a transformer or a DMLP or whatnot. And in machine learning third, when the time we lied, this model is H, what we mean is we find some data star. Besides that when I evaluate HDPAS star and some data point, it gives me an asset that is correct for as close to being correct responsible. So this is the line in staff of the models. The 1st thought is the architecture, recognition, and the 2nd study that you do there, you know, which is basically just this search process, which does start, which is sort of your optimal parameters. And in the field, usually when you’re trying to do these things, you can have 2 different concerns. One, is a lot of the speciality. So when we’re talking about the specificity, you are worried about speaking an architecture, and this usually comes by different sort of where this Tita lives, like how many, if you grades you have, how they’re arranging architecture, like what is the structure of, you know, network and so forth. In such a way that the model is that is expensive enough to express whatever you’re trying to do, right? So creativity is basically the diversity of functions you can get by changing the back. And you want to have sufficient diversity so that the function you’re looking for is within that set. So this would mean proximity, and then the other, uh, side of things is liability. So we talked about inability, is sort of how Ava, are we, given this architecture, and doing the search processes that we have in British, which will be great in the sun, to find this Kita’s card. That’s always the problem, right? Um, so usually we use sarcastic radio for that, but then the question is, you know, how likely is for grade and design to be able to discover this deep as part from the data that you have? Um, and then these are 2 2 sides of of the same point, but they start to find the GVT worry about when you set up your model. A lot of times, you know, each other, they treat separately, so you’ll find a lot of work that are just focussing on a facility and they’re asking questions of, you know, can an island do X? So you have the fashion of 10, 11, X, and usually you approach this on exclusivity for interview, and what you’re trying to do is to show that there is no configuration of parameters for the other land that you’re considering, such that you can include the tasks, right? So yeah, this is sort of a typical avenue of approving things. And usually if you’re just focussing on a specificity that the sort of different things, a different reason, and makes it easy. Uh, the other, the other side of things is on ability. Um, Usually there’s not that much work, like, group of 12 papers if you learn an energy, but they interact with each other. Like you can have a very expensive fam your models. They can express anything that you want, but we’re learning is really hard. And because of that, you will never be able to discover any new resolutions. In the same way, you can have models that are very easy to learn, but they’re not expressive. So you cannot be something. like that. The idea moving forward is we’re 1st gonna ask of the question of what is the different regulations? What I mean by that is basically what is a good architecture, what is a good choice of model family? And then we’re going to ask the question of how do we find the start? It’s a bit, um, and the answer to both of these questions is going to be, you know, on that right, and then you do that, and that’s that’s the way that you said, uh, good purpose. Like, you know, are, are biased, are structured, right? Well, we’re gonna start with a linear model. because it’s a little bit easier to understand. gonna play a little bit of your novel to understand sort of how this would work. And then from that all way we’re gonna start talking on neural networks and why would you want to join for dinner your models in your network sort of mod out and how they so for. Um, so, your modelX basically align. Um, And earlier model, uh, so the size slide that I borrowed from Antonio, I get those course, um, on authorisation, but, uh, you know, New York Mall is actually quite, quite easy to reason about. So like how would this work? Can you have your data ads, you the matrics or big data point, uh has Ddimensions as the features, and you have some fact that why? So some real number that one for food for any of these data points. and then when you want to build these models, I’m kind of repeating what I said. soart already, is you’re looking for this FWOX which is really just X multiply. I think it’s a sort of set up a notation, and and therefore, the output that you are, which is this Y hat, and that you are outputting, it really doesn’t weight in some of the features on that side, right? So this is sort of a outlook. So the question I’ve been trying to answer now is how would you find the W for it this? That’s what you know. I mean, that’s kind of, uh, so, um, if why is a real number, if your data, SMI, we’re gonna go back to classification, because classification is sort of where we started, and we’re gonna talk about having the classification with this kind of thing. But for now we’re gonna do the pressure. So let’s assume that why the real numbers? So you really just your data point and this, this thought, you know, really what you’re trying to do, uh, is a relational password, and I’m trying to discover it, and like. By the way, how many of you are quite familiar with this stuff and this is going to be some important for you? How many do you know being education and all that Yeah. right, so I’m. Okay, so and then you have high dimensional or become the hypperame. So this is we can come back if there is like any surprises, but I imagine this is not an surprise here. So Um Okay. So, um, there is this question of biases. This is 2.00s as well, but when you want the hearing. Otherwise you applies advices. When I try one theory, it’s very annoying to have that button. So usually fold it into the weights, which is by augmenting your features with a one and then, you know, you buy it trans away. And that is just more notation than anything else. But like by keeping everything to be to WX, like when you’re doing it, it simplify your equations and everything a lot. So it’s just sort of, you always can say, without lots of generality, I can drop the bias and I can just go forward. So, okay, so this is sort of what you want to do. So how do you how do you getire there’s a one way is to minimise the distance and here we use min start error to tell terms the most natural decisions when you have your number. Um, and you’re trying to look, for example, your side, that the loss is as possible. Okay, so… How do you do this? Well, assume you have some XW start, because why? You can multiply both sides, the X, and then you get this X sample, X on this type, with extra sports Y. So now this this metrics, you can work, you can move forward, then you get sort of your, um, your solution, which is, uh, just X, transpose, X, the minus one, X, transpose, Y, and this is sort of the, um, the inverse. Or another way of being 2 ways of doing this. So this is why we’re doing this. I don’t know how interesting it is, but it just sort of as deriving manipulation or you multiply things and everything. But the other thing is, you are just observing that you’re lost in the photographic, and you know how to solve the battery, the back, et cetera, the room, and the gravity is zero. So you just got a idea close to zero. And then when you said, you know, your friend beside this, you said this thing goes to 0 and then remove the So you open the brackets and you get the same thing, you know, the soil on the other side and you get back sort of the same the same formula that that’s where we have. All right. So I’ll go to the next episode. So so far we’re talking about in your aggression and I showed you sort of how solving irregation in performance, which you already knew. very much like that. Uh, so the next was the most the probability in that attention. So the reason for this is that we want to move to our specification, which, again, I probably, and then it’s just something that, um, this is a high level. Like you can do more of deep learning without necessarily caring about mobilities, but if you different things are probabilistic framework, it’s very happily in many scenarios and you can help you answer all kinds of interesting questions. There are much harder to play, more understand, if you are not on music, uh, scenario. I mean, there’s also like for lots of people actually framing things in a probability, by statistical manner, it’s more useful because they write it, some people. I’m one of those people about like it. I don’t realise the statistics are probabilities, but I don’t know how many of you are very, you know, statistics are preparing things in us, just because very more. so that’ not. So at least uh I wants like them. But it’s something that you can’t leave your voice. And when it comes to the method that it process things, you know, it’s’s called statistical things are a big part of it. But, okay, so you have art in your models, like, I assume many of you already know this, but is there anything since, like, uh, non- non-what you would expect than than raise your head? So I need your way to take this near model and compare it to a classifier, you simply, you know, you use the line and the decision boundaries, so you’re basically classifying it, you start to start with A or B based on – on the value of whether the values like are smaller than zero, right? So this will be one way of playing it, and then, you know, this will be, okay, this is all, every time publications, but you can you can make this in a more publicistic, um, uh, play much. I assume, I guess how many of you know likeistic integration? I assume I can kind of relatively go faster in this part as well, and I guess I should have maybe junker activity in an effort. But, um, but there is a different way of doing this, and that is, if you think about it from a more probability place. I, again, I borrowed some slides from a transport course. I got to go through, and the idea here is to introduce, Sigma, it, and what you’re going to end up getting out of it with some form of regression, right? Which is sort of the basic. So disappoint, the generalised continuing amount of time on each other side of kind of the models you get on top of the linear, architecture, and there’s sort of the bases. And a newer networks are an expression of this, where you move from the linear to the nonlinear by adding some this narrow-wise structure that has no immunities. And that’s where most of the magic happens and sort of most of the research is in defining these types of layers, how they, what they view to the space and, you know, stuff like that. So we’re gonna get that inst lot. But if you are still sort of in the disregulation in your case, or you have these observations, or this is going sort of very step-by-step, it’s very kind of no levels. So, be boring, but, you know, you have exercise, a teacher actor, right? all of these X, X1 XN where they represent different features. For example, I think maybe it’s on the next line. The example they’re doing here is something like movie classification and you have this sort of detector, it’ll say, well, you know, the work, worse, what do you, what do you, so run out? inside sort of the teacher, that you’re going to use to predict, predict the sentiment, whether it’s supposed to be a negative or completely a movie. So your output is 0 one for like positive or negative, and then your features are some properties of the of the of the review that you’re doing. So exactly what I’m saying. So you know, you have the video containing the word awesome or at least one or one. These are sort of your your thrillers, your your teachers, and then you wanted to say everything is possible energy based on that. And the way you do that is you assign some weight. Um, I’m running away with different words, did they appear in the, in the text and based on that, you can do some kind of condition, right? Um, so, so this is sort of the problem statement, and then we also can have multiple classes. It doesn’t have to be your normalal or long it’s usually U. So the way you can do this is you can, you have to wait in some of the features that are telling you, for the features are zero or one, if some particular word happens, you do your way at some, you can have a bias at the end, and then you get sort of these value Z. And if G is high, so you use negative weights for tinder, negative, but you wait in a positive. Like I said, well, this is positive and high and musical. It’s good positive uh sentiment, United, zero. But you’d like to make this into the classifications, or from releasing frameworks, and a reasonable authority, and you want to do other kinds of interesting things. But. So what you’re doing is you are taking your senior, push it to a function. that they normalise things between and one. And then once you’ve done that, if you do it properly, you get out of it is a distribution. And then the functional use is a cigarette, so it’s one of our one class, right, that’s minus zero, 600 functions, and the function looks like this. So, I can see if something is very positive, because to one, something is very negative, those minus on, has this particular shape and you can, you can, uh, So why now is the probability? Because between your N one and has a very nice, uh, See? So the idea is you do what we did before, but we pass it through this. And there is a nice thing here, it turns out that, well, POY plus one, equals one minus POY plus zero. um which is because of the form. So sorry. So, one of what is match is fine, show you, is to say, to show you that, well, this 2 are one man and the other. which you talk about the fact but they need some to want. Do you have only 2 positive values and some. they kind of respect this property which you’re looking for. So now, you know, if you, you can, you can train this into a, if you just need to make a hard prediction, you can use the same thing, you can compare it to 0.5. And in fact, what you get out is also, um, and secondly, other things, right? So when you’re looking at why, you get of the probability of the sentiment of the review to be positive or be negative, right? more minus what you have. So these are basically the comparison from, principle, in a model to something that has a producing interpretation in the classification scenarios and the binary classifications. So not you by using logistic function that you sort of re normalising your output in such a way that you can now use them as probabilities. And then your decision boundary is going to be as .5 for dollars I can. So, man, yeah, I mean, there’s a lot of nice properties of this particular population and this particular graduation functions. So like this, this point here is also the unstable point. So, you know, the amount of learns, it prefers to go to the extremes rather than staying there, which is also something that’s quite restored, because you want more of the half. Yes. I’ve seen a few times that in order to group some models, predictions, we can do this decision boundly. Is it? Is that the problem and isn’t that it does produce better results, or is it not just about in the problem that the model is about you just probably friendly? Sorry, can you, I maybe missed the question so the 1st one. tuning the the decision back. Um and also the classified. Um not this mask the fact that the – the class very by class? even though it does produce the results. Is it just that you for performance? You would mean, like, maximum project kind of things to make, not even necessarily, like, yes, yeah, it depends on the solution of my data. maybe, like, moving the zoom value from .5, so there’s no one percent of the other, and, you know, some relevances of my first time. Oh, Oh, I see. like people playing with where, where do you see your boundaries instead of using 0.5. Um, I mean, I, I, yeah, like generally, yeah, okay. So generally this happens a lot when you have imbalanced data, I think, and that’s sort of a 50 scenario where you play with this. Um, I, I mean, I haven’t played as much in that scenario as I should have tried. My, my, my, my knowledge is a bit shagy, but I, I, I, I imagine that by just um, subcentering the data or or doing other kind of tricks to the data to correct for the imbalance, would, would work equally well, leaving the decision boundary where it is. But it might also be that these things are actually quite equivalent. Uh, I haven’t quite it out in my mind at the moment, but I feel like whether you’re playing over it in decision boundaries, or you’re playing a bit of the data, to make the data be more balanced. But th- they they kind of did the same thing. I would trust more, uh, not playing with the city boundary, but actually trying to figure the data, particularly if you’re using data augmentation, which adds something extracted, instead of just upsetting the very popular class, and then reducing your own data set, which is not ideal always because it makes learning harder. But if you have a way of absenting by adding their augmentations of all time, I would. Yeah, I would believe that would be the optimal way of dealing with this. Playing the decision boundary, I agree with you, because of the dodgy, but I think that’s exactly the same as the substance, you know, in the technical sense, particularly for the logistics regression, right? When you have, uh, at least, you know, an attraction, it’s become, maybe, really well-matched, which is good for, yeah. Uh. So, I mean, this is the point. I always, I mean, for, you from, like, uh, uniform, like, this. Um So I’m not sure, I… So, what do you mean by me from my computer? I have 05. Yeah that’ like beautiful So show me the data some are distributed for me thatat one quite likely. Yeah, yeah, yeah, yeah. yeah, yeah. that’s usually just somethingion that debating it in all of this. So you assume that your data set is balanced, where it means that the likelihood of, uh, a priori, without any information on the data for yourself, you assume that it’s equal probably different. there are point to be of any of the classes. This is the standard assumption that done and uh in information. So you have the imbalance kind of scenario, which is usually seen as a niche, and then you have sort of all kinds of techniques. But most of the techniques in the imbalance case, when there is some, like, prior on the distribution of the classes, they are usually just about making the data set balance. Most of the techniques are like, okay, more things in such a way that you have legal number for all buses, so that’s exactly what you meant. And then – and then the playeration. I expect that my application number, we should be in for the bus, we can reprove. Um, So you’d expect that like, uh, If I implement on the balance, how… Yeah, yeah. I mean, you expect that like, okay, if you don’t know anything about it, without looking at the data point, the property of the data point to be of any of the class, in the VTD, yes, that’s that you’d likely expect. I mean, there is other things that people do like when, but a lot of, okay, so maybe this relates to the 1st question. So, like, one thing that you can do is you can pick the model, and then maybe you can play with the decision boundary, and then kind of move it around. The other thing is, you leave the decision boundary where it is. And you just learn the weight to push the distribution to be another one. So, like, for example, one thing that people like to do is to have this max marg- margin, sort of lining, where basically what you’re doing is you’re pushing the weights when you’re learning in such a way that the decision boundary that is present, this decision boundary, ends up being sort of furthers apart from both classes. Like you see, if you imagine the classes as being coming in some some areas, like you want to cut exactly in the needle between those areas, right? Um, So you could – you could do the same by displaying with where you put on the same boundaries, but usually people don’t do that. In machine lining, usually, actually, well, at least the part machine I think I’m aware of usually don’t touch this. You only touch W and B. So, you’re, you’re just like tweaking your learning, um, you’re adding regularisation terms WNB. such that you push out of the learning to discover. So I said that this decision boundary is where, like, where you want it to be, without you, like, playing with this number. A lot of times you don’t even need to make decisions. For example, in another level, right? Like what you want to do is you want to stand for. Like, basically, there is not even, like, what, the – the desired output is just the probabilities. It’s not even really a decision. It’s up to the probabilities that – that you – you need to be more welcome to something else. Okay, I don’t know. Hopefully this is clear enough. I mean, yeah. Okay, so you can you can do this street to do that. So now the interesting part is how do you do the distant traffic? Now like this so you can apply news by er as well. Yeah, that is the wrong wrong thing to do in some sense. because it doesn’t take into account the time that you have this probability. So it just feels like its around loss. And then there’s like senses, you know, changes around lots, right? You can see that by seem to be playing with it. Like, say, the right loss would be negative, not black people. But you can see like what happens if you take a system like a SC one, the software, and you have to layer, and you try to try to make better. So it’s going to try. train? they are a bit dog but performance well. I mean there’s like I biggest reason for this I hopefully will get into some of that or you can anation. But, um, in in, in machine learning, usually there is this concept of a matched loss function to get relation function. So usually the way people hear these things is I guess, this is from the pro going to you. Spending what your choice about, too, like whether it’s a linear model, or the output, or whether it’s a signal or whether it’s a softmax, or it’s some other type of activation, usually those leads to some probability interpretation. So if it’s a linear model, you assume that you output on your model is like a dalshan, discussion, distributed, it’s, you know, it’s a signal itself. I norm. sometimes they want you know. So um So because there’s a provision interpretation of it, then there is a loss function that is matched that probability interpretation. And hopefully we’re going to get there, but the whole concept here is that the, uh, matched, uh, loss, uh, for for that particular process, interpretation, just comes from, um, basically using tile divergence, as a decent machine, too much, Ontario. So you can show that if you use kayal, as a business, or maybe right away, and then you work right, you basically get back out of it. Maybe you can go back. this that so important so. So it’s it’s basically just because you’ve made this probability of reputation, you have to now respect the policy of reputation and come up with a, um, with a distance between, you know, the taste of the contribution that you use. So you have the distance and internal distance, the further, but you have the distance, the tile, and given that distance, you know, and given the proposing assumption that you made, whether it’s adoption or a binomial, if you work out what that means, you get sort of a simplified loss, which is sort of a match loss to that. So, in a much more probabilistic kind of framing, that would be the way you would approach the problem. The way I learned this stuff, it’s usually you have a set of occupations that is always linear to the regression, it’s always software, it’s really presentation, multifacetation, or person-wide, if you’re going binary presentation, and then you have the losses, and you just learn in my disk, everything. If you use see my use business US of nice. At least that’s how I line them. I, I, I learn or I figure it out by leaving books much later, but there is a reason of why being this lost for that activation function rather than just so I can do whatever you want. But usually if you just learn sort of group that way and I’m sort of how this means sort of. Um, Okay, so let me let me done. So the way, you know, you usually kind of entire decided sp rice is from the port I haven’t written the sl myself. But the way you get to this, if you want to get to medicine, go back, if you say, well, like, what really am I doing here? Do I have this holistic interpretation? So I know that the zimo is taking a mobility. of why viving on a certain class even your obsation act. So what I want to do is I’m looking at the likelihood of the 2 labels. So you can write this product, you can easily see because Y 0 one, then either be one or the other, is that this product simplifies to something that would make sense, right? If a true label is one, then the quality of this is just white hat, which is sort of what the mouse is, you know, the problem people in this week, they were, what, I don’t know if you guys remember, but that was sort of the convention we had, right? The model, the kinds of probability, and in the probability of the past being one. The prodigy of the class being 0 is one minus that thing. So, you know, if why is one, it’s a flip, and all you get is quite happy, probably of the time being one. I mean, the true level by zero, then this time disappears, you just get the 2nd term, and then 2nd time in just one minus by half. which is over here over the past be zero. So this product is really the probability, I guess here it should have been POY had equals Y U the X. So this is sort of what we want in Mexico. And we want to maximise this because, you know, you want, um, You want the, the training points in your data set to have hydrogen UP under your model, right? Your model very competident, he says we’re training point part of the correct class. So this kind of makes sense. Hopefully it does. So you are a maximum. is this? Uh, where you do it, each again, it’s convention and uh, makes things easier, uh, on my table, and you take a lot of this, because, you know, you, you just kind of usually move into a lot of place. So you know, it’s a lot of space to just the alternative formula here. And then you will also additionially have the sound this overall your trading fund that you have. And then again, this is an organimization problem, so you need to maximise the load. But uh I guess what people prepared machine money again, this is convention. I think I haven the opposite. want to maximise results. I mean, machine learning, you always want to be a model. here is friend around relation. So, you know, we have this, but here, the ball, uh, we should maximise it because we want to maximise the building of the direct class, even your observation. So if you want to go back to some organisation problem, we just add the minus on the front. now now I have to nameim I it and then this turn out to be just for profession and then the sort of the – the reasoning of how you get a person. So why is person to be the right objective is because, well, my person that he’s trying to do is maximise the probability of the correct label or only in the training set. So this are kind of a individual step-by- self definition of that. And this is sort of the kind of correct of the difference that you need to have. And this is present, I mean, this is trying to go back into the results. I just I don’t know you know, terms if we go travel around, but it notget has a minus loans like people white. So that’s – that’s a very tough. So there is a, I wanted to also mention that you can get something, um, a lot of different way of kind of, there is multiple ways. Once you move in the, I think that’s why people really like to paint things from closer point of view, because lots of type things are worse in point of view, there’s a lot of communication that I do and guess, overall time, it’s, it’s a kind of, very flexible, pretty much, which is what you want to do. So I I hear some flights there where you can kind of read write the same thing, but start from the main r. Um, where, instead of training, so the way we train the problem before was we wanted to maximise the probability of the trill label over my training itself. Here, you can say a generic way to kind of get your objective is to say, I want to, um, find the probability of the correct ways, given my data, and then I’m using the base role, aside from this, to get back to the trust centre pilos. So you can say I have some data, and what I’m looking for is, what is the probability of, of uh, you know, what is, yeah, probably a few times you have this, this data, but, uh, so I wanted to get some of these, but it’s key, because I don’t mind, but here, and there’s more of this must here, we might get a start, so this is the most likely, uh, other distribution, is sort of the pattern, it’s talking about, characters that I’m looking for. So basically, okay, so the returning here is that’ll be more derived. kind of assuming there exists a keep us stars, that’s solve my problem. but we assume what we assume that there is a institution of Aita and the distribution tells me how likely is that te have to be sort of the optimal parameters even the observation that I have, right? And the assumption here is that, well, I mean, this is sort of different in practice, there might be multiple people stars that would be equally likely to give the optimal things, even in the data that I have. So what I want to do is I want to learn sort of this stuff here. I want to find my distribution of our TTA given the data that mean how likely is or just up to be the of mobile test. And the way I find this is that they need the page rule and I provared things factwards. So, like the base rule, you know, it is, uh, you know, able to be, but still be the name, but it is, like, you know, modify with, uh, the weight by the recording, by being messed and most around.. So I can use the base role to invert things because I can, you know, I do not know POP type even data, but I know PO data given a deduct. because that I can just evaluate the health. And my model, I have my here so I can do that like what is the pub date I can push um uh the inputs from my model like in the outputs and I look to see how likely the 2 labFi data, right? So this is something I can confuse. You tell me, I do not know. this is what I want to do. obviously the. So, the whole point here is that another way of being driving this whole thing is to start from the base road. Sorry bit but we start on the base road and what you get in the end is the same objective as before, right? So you end up for the same kind of object. Um, So, you know, you do the reversal and then you just sort of, um, you know, again, because you want to minimise, you put the minus the drone, then you take the load, and you get a negative of light code, which is what we had before as well. So there is a difference, I don’t if I left it here, which is if you start from the Beijian perspective, because of the paved rule, you get sort of this additional term, which is um, okay, so because of the base, you also have this, uh, in the from the page room, you also have the probability compared over your parameters. So this is just ignoring it so you, you know, because you’re being sort of propertyation. Besides, or just, you know, you have probability of your parameters, even the data, because the probability of the data between the parameters, times the probability of the parameters, without, you know, in knowing anything about the data, dividing up the probability of the data. So this like a parameter, this is your prior. This is sort of your belief over like what I’m doing for parameters, parameter values, many, which right working, without any data, right? So you can have a learn from the prior, so you can say, you know, everything goes. There’s no assumption, or you can have any foreignology prior. You can say, like, I believe my diameter should be small in order. This is otherwise a difficultical thing that people would do. And then that’s just because we like small numbers. So say, you know, and but this is like it has nothing to do with the data. It sort of just your general belief of what I mean for value for your parameters. So you have anything like this and a proper Asian person with art, you should always have a prior. You know, you should always have a belief, you know, even if you, before you’ve seen any data, you always believe what is likely and what’s unlikely. But the, the, the thing is, if you start from the base role, and you’re trying to derive your objective this way, which is sort of the more proper way, so to say, when you apply the log, you get plus a term, which is sort of log of the probability of your parameters, right? So by jump ahead, the thing here in being is comes from the base rule, is that thing Yeah. it’s that thing that you have the minus because you put the minus in front of the whole thing because you instead of maximising our minimising. and this turns out you know are going to be your regulararizer? And the role of the regularizer is to say that you can have multiple task that seem to work for your data. You have some preference. Which of those people you would rather have? So say if you have multiple values of data, then you use your error. If you have the small non-prior, you basically say, well, give me the one that has the smallest magnitude among all of these possible data that means your error. This is a way of ranking sort of possible solution that you have. Obviously it’s that doesn’t necessarily work like that in practice, but that’s sort of kind of diseat it. And then you have this next turn which is just the pro of your data. But this usually disappears because, you know, this does not depend on your parameters. This is just something about the data. So once you have this thirdd and you tell the very endength of it, you have very any sense, this time is suggesting but the variant of this respect data is zero. So this time is usually ignored but itgn know because you don’t have any control over our data. Data is the data. When you’re 5 or 8 is 5, but, you know, data is given to you and you’re not controlling the data or anything like that. So this is not a fact. The only thing you’re controlling is the moment. The only thing you can do, the imaginative is not by because, and it can also look at your prior, and it’s a prior to, so it’s not a data basic solution. So let me see what the next slide, please. Okay. so it’s prearily and I say this is sort of this andction. So we can take half hour, right? and then we can continue the next dollars. I’ll try to, so manage the name, okay. let me let me offer me. So mine is sending guys like quite everybody simulation and this kind of stuff. So I might kind of keep my hand and need the ne network person next session because I feel like that would be more interesting and better of our time and I added this section because I didn’t know exactly what we said, but I I see that they nice view point of it. Uh I might mentioned the uh I might might keep the point about the cayak because it’s kind of useful. Maybe a slider to it end up into the new network stuff afterwards. and then hopefully sometimes I’t always gonna speak up and it’s gonna become a lot more like want do that

LECTURE 2:

Um, and, um, these are, like, 2 standard scenario, that people, they are, for example, what, you know, what, some do. One is they cannot separate the 2 moon. Which is obvious, like there is no way I can rotate this line and move it around. The boundary is not union, so there’s no way of hitting a line of humility. Explorer. And maybe the very… I don’t know, I can explain the high baseball, thinking of the… kind problem which is the same reason Things are not very acceptable, right? This is You can do that however. You need to go protection your day, right? Like, you know, like, here, you have uh, sometimes with circles, there’s no line item here, and stuff like that, like, means, versus with the rent, and it doesn’t exist. But if I project all my data points to just Arabias, or, you know, how far the data points from the centre, using the centre and zero here somewhere. Then, like, and then I look at the, the day after this protection, you know, things are even really simple. And everything is fine. because was called Uh, so you have to try… I, I, I mean, I just think this is very intensive. I wouldn’t be surprised. I took this from an example from another that they were trying to give an example of like, I wanted a projection that I can easily write down mathematically and it’s very. So I went through the radius and said you want to know about the latestency. But I wouldn’t be surprised if there is a particular name, but basically, they just need to put it differently. I think in the 90s, computer vision in the 90s, maybe early 2000, it was all about this. You know, they would come up with all kinds of projections, much more interesting marijuana, I wrote. But all kinds of mathematical function protection will call them features. these are feature engineering, and you’ll get an image, you compute this sort of battery of features, and then you put your linear model on top of it. And that’s sort of what computer vision used to be. And, you know, you have shift features, another kind of feature that are exactly something like this, right? That you continue. Um, And then region here is that, um, if you take your input, you push it through projection, some non- some nonlinear subtraction, that’s important because why is it important? What happens in the product in un? doesn6 the ring I think. Exactly. So in the projection of the lunar, you can modify these to linear things and, you know, linear to a linear linear. So that means they should have been a linear solution to start field. If, like, you’re defending your projecting your coffee. So exactly. it doesn’t change the type of thing that. So it has to be nonlinear, but you know, if you have a good, non-biniar places, then, you know, then you can put sort of your generalising your normal jobs. It be intergation, your logistityation and whatnot and things like, well, and then this has been what people have been doing for a while. But the realisation or the fact is that picking this correctly in a transformation is hard. And sometimes it’s basically solving 90% of your task. Just coming up with the protection is basically solved the thought by hand. So there’s not much to learn afterwards. in some sense. So a big a big question in the field, and this is sort of working around at the time, G is, um, can we learn? Can we discover the projection itself of the data? Because we don’t know how we already want the right convention is. Um, and then this is all about your methods are trying to do, or, um, and, and, and, you know, to do this, you need to pick a, a way of formalising what structure in the protection so that it’s not above, right? So you need to have a structure for it. And the structure is sort of this layered, architecture of murals. And this is inspired from biological models. So the way the field has started was in sort of the neuroscience, more or less, where people are trying to understand how the brain works, or it doesn’t, you know, the field hasn’t started by people asking the question, here we learn a projection to make things really inseparable. They were trying to understand biological neurons. So biological neons, I’m not a biologist or neuroscientist, but you know, I think everyone knows this. They have this kind of shape, they have an accent, and they have the Zendra, and they’re sort of connected to each other. And so the other people have done is sort of the death track are collecting inputs from neighbouring neals, so there’s sort of this rated sum. And the input, this input is being sort of connected together into the sub operation from here there is and I’m going to go back to it. There is something more, which is there is some no any transformation that is happening within the Europe. And then this is sent forward to other un know. So that’s the other. So this is kind of the structure of, you know, so you have a normal idea that is, I’m gonna talk about it. And furthermore, the break is very… I think this is a very sort of traditional view of the rain, the particular part of the visual pathway. Um, I think there’s questions about it because the brain is is layered, but it’s also highly decurred. So you have sort of feedback connections throughout the brain, that we try to be more in this sort of way. So there used to be, I don’t know if they still do it, but there used to be sort of this stoyish view of the of the of the visual pathway where you’d say, you know, the information goes from from the eye, it goes to the UR, or V one, it goes to be two, from 23, it goes 24, from before it goes to the IT, and, you know, things happen from then onwise. So it used to be that like you have this year ID or stages in the how vision was processed, and this used to be sort of what people would argue that would happen. I mean, do you want, you do edges, a knife detection? And then in detail, you start looking for shapes, and then in B4, oh, there’s no D3. So, as I said, so they skip these And before you start doing objects and in I you have faces, right? So there was… A yerarchy, a laird yerarchy of how the pathway works, and there were some of the semantical differences between these layers. They were composing, you know, there were like building up more and more complex features. And this is sort of what the neural networks are trying to mimic. So this is this are the intuition from the I mean, like a lot of it, I don’t know, 60s, 70s, before that, 50s, uh, in how the brain works, and this is sort of what, what, uh, the neural networks like in the day were trying to mutate. So you have like the… What was it called? Neuro, neuro, commuter, new. Exactly. Thank you. That was basically inspired by this, right? So, um, okay, so this is the architecture and it’s, it must not be here. Why must not be linear? we already talked about it. If you don’t have the nomeniality, when you layer, this architecture together, like it all collapses, because media composalenia is linear, so there’s nothing there going on. So, a key point is to have sort of this on inality that you have after you put in the sequence, right? So, so this is the inspiration and sort of this is what you’re in order of are trying to do. They alert a inspector by doing these steps. But of course, you have some bunch of choices here, how you do this linear, what you’re doing these weights, and how do this in your projection, and how do you do the long, yeah, one choice of them, you have key, either. Um, and um, So we ext these choices are hard to try. The way people have started was we used for the symbol linear test in your production, which is what we do nowadays as well, we say in an MLB. And then as activation functions, they used to use sign modes a lot. And the main reason why I signal it was kind of the starting point in playing with this is because the signal it has a probistic interpretation. So the way they would see it is sort of the same way, but you get the probability whether this neuron should fire or not. And the whole point was, instead of, so biological systems, negation of service, you know, they work in impulses, they don’t, they’re not continuous, right? They have these trains and the pulses, the days before. like in the artificial one of thegin one that because you don’t know how to do this kind of speak me so you want contin your values. So we replace all of that with like the probability is higher. then very touch interesting might especially more kind of American point. But obviously these choices join out the idea. So then we’re gonna take a bit more math perspective to kind of get a sense. And in particular, um, Not necessarily, this is how it felt right, but But in particular, I think one activation that is widely used right now is real, instead of simoid. And we’re going to talk about why people ended up switching to level protein weight and so maybe what was the, what was the, what forever before? and sort of what I did to the using weight. But a simple function is the Ralu. Um, and the reason I was going to start with Brau, and then after Delco, Resimo, it is because Brau is a bit easier to understand, mathematically speaking. So, you know, that allows us to be. So, okay, so if we have, like, a single layer of a neural network, following this sort of 5 under this, right? So, as I said, the visual… The visual, you know, had this, like, 3 or 4 strages, or like, few or 4 layers. But for now we’re going to just start with a single layer because it’s the be sort oft to understand. So if you have a single layer, like what happens is you have your input, you do some in your projection, so here’s frame as a protection, but we are, and that plus B, you apply rather than is applied individually to each neuron, because that’s sort of how the neuroscience, the neuroscope as well. So then the non-reality happens within the axle, within the neural sub. And then you have your linear model on the top. So what is nice about this architecture? Well, 1st of all, you can you can now solve problems that work before unsolvable. Um, In particular, the random model, it has a piece-wise linear boundary, and it appears like leaner boundary increases because you could draw this kind of season boundary, that will separate the 2 modes, right? Um, So, uh, maybe I’ll stop on this like… Is it intuitive? Why it is a besides linear product? And For this particular problem, like integratively, like, would you even know how to set up the weights of your model to get a decision boundary that looks like this? All right, then you’re having regions about how these things actually work. So you can. the coming flight we’re gonna get to this. But I just wanted to make sure that I, that I think, I mean, maybe this is obvious. Like to kind of get an intition of how all of this things work. So, um, You can imagine, again, like we were typically kind of talking to this perspective where we have sort of indomain and out of domain. Like we really don’t care about necessarily the behaviour of the model. I’m entire real accesses. We care about the behaviour of the modelels here where we have the, right? This is how we start thinking about. And then when you’ll start thinking about the key that it one at a time. So you can say like, okay, a starting point for the decision boundary is I have sort of this straight line, right? Which is sort of maybe where the linear model starts and, you know, up to this point. So I can take my 1st human unit. and say, well, it’s 0 for negative numbers and decide because I don’t care what they’re not doing. But then from this point onward, it does, sorry. I should not have the same. Um, From, from, from 0 onward, we have to be straight by it, right? That is fine. Um, and then I can do and ask what is the 2nd li does. And I can say, well, if even the 2nd heat in the match, I can I can use my bias in the linear protection for my 2nd season unit, and make it such that the 2nd hit community is 0 up to this point where I want to sparkle in the 1st few sides in the net, right? So I can say, OK, up to this point. Actually I will ask all the to be 0 off to that point at least. And say, for the 2nd unit, after this point, it becomes active. And then once you become directly with a linear function, and you just need to be a linear function site that added with this previous one, you’re going in the slope to that. And that’s something you can compute. I mean, I here, I’m just doing the high level infl. But like you, you know, you know how if you combine to linear things, you get a linear thing and you know what it would be, right? So here you just say, after the 2nd after his point, all of it, the 2nd view of it is basically as the delta that he needed to change the slope to that part. And then at the next junction where you have the next, please, we make the 3rd feeling and then become active. And again, we can change the linear as well to anything bit more. And then in the next step, you get your foot and so on and so forth. So if I have like a one B problem and I one by hand to construct a particular solution, I could do it this way, right? I can order my heathen units, get them to activate the different points on the real line. that’s where I’m going to have sort of the the joint between 2 in your pieces. You know, I besides linear function. And then every time I have, like, energy, the internet becoming active, I can change my linear slope anyway I want because I can do whatever I want. And I can’t go backwards because when it’s a function, but I can I can change it to anything that I want. At least sorry, I can construct sort of any kind of piece-wise, at least in one week. I can do this in high dimensional, but yeah, you can. So do you mean that every new one partitions our space into 2 regions? That’s what you mean? Yes. So that’s piece-wise linear. That what it means. Yeah, PCI is in here, it means that locally the function is in here, it’s sort of a composition of, of, of cleaner pieces. That’s some of the functional exercise, you know. This, but the way this happens in each neural, it’s the space into two, and it changes the linear behaviour in the moment, just means. Is that why we say we can represent any, uh, content with a single there? Yes. Yes, yes. Yes, exactly. Sorry, I have that. So, um, this is not a proof. There is there is proper proof that you can read and that technical, no, that. But this is kind of the way to the universal opportunities. here, which is exactly way to ask, which is that travel network, I mean the universal approximation from 189 from they’re all 1 monings. But the newer ones are for, uh, well, because we just got back in the day where we want, I think. Uh, but basically what they’re saying is that, um, If you have an infinite number of units, if you get approximate temporary as well, any service. And in the rail case, this is a one D case, you can see how that can happen, right? You can can start if you have an infinity number of min pieces, you can ad them any you want and then you can follow the other, right? So this is, this is the, the universal approximation of theorem. So these are the big things. Um, Sorry, they were, they started some document, I think, it’s a big thing about exciting. really got a lot of people working on normal letters because of it because people were like, oh, you know, this is the right, right? It actually says a lot less than you would think. It’s actually not informative at all. as a theorem. Because it is, um, It’s a, it’s a, it’s an argument about expressivity. It’s in the limit. So it’s no way in our means you can never have it practice. And it doesn’t say anything about their ability. Like just because I can express those things. I mean I can learn them with any kind of many hours well. Any particular, that other things that are different for personal interests, that we don’t use, like polynomials, you have, you know, you can do whatever you want to call it. We don’t like polyonians, you know, we’re waiting for non uses a nice data, you know, that’s right, because they don’t work well. So, so, uh, It is useful, but at the same time, it’s very unusual. It’s not a reason why we neural network. unless I said other thing. But it came on the back of the previous AI winter, where you had a sword problem, things like this, where people are trying to show that, well, they think you can’t express. Well, this year, people were saying, you’re not going to be able to find another example like that, I can express everything. but Yeah, the question, the PDS, right? Yeah, previous, sir. Given that, as I said, we have one, one, I think between the number of things, more boundaries, or drawings, and the number is in units, within the same day. How does it translate if attends that 2 layers? Exactly. That is the next sl about. Can you just understand itly? Yes. Yes. I think also? or? So you get exponentially more linear pieces to prepare you make the model. should more profession. Should be more efficient, yes. So I’m gonna go on all the details. So this is exactly what the size. So you have this stuff, right? So this is all about the female people like I said. But in did like in look at the music church, you talk about that, right? If you look at any of the paperapers even now, it’s come about having very demarble or 20 class layers or 100 there so forth. And actually, you know, when I said there was this rebranding from your own efforts to be learning, this was basically the boundary of the dread, prevent, you know, like a deep lighting, and a deep new method that had at least 2 layers or more. while neurural e used to get similar. So that was sort of how they rebranded and how they pushed the field. But anyway, that was very important. So then one big question is, why, you know, what do you get? Like if you have universal opportunity with a single layer, what is the importance of that? What does it provide? And the answer is basically that it becomes in sometimes more efficient, but then it will turn out to be allowing for motivation as well. So we can. But maybe before that, um, How do you do this? Like how do you how do you argue? So the way you typically argue, and I’m already kind of in start the position, is you find a metric that measures sort of the expressivity of your model, the flexibility of your model, right? And imagine that you typically live in a red architecture is the number of linear pieces. So the number of trainp points or number names are whatever. I don’t know what the write image is. So feel free to correct me if I’m using character the the sound of where. So if we started a single layer, you can actually have the maximum number of of linear pieces that you have. And this turns out to be a very known result in math. So maybe some of you provided know. This is for the, the last is here from the 17th. Does anyone know how it works or what is that? Do you even like Okay. So in the match, guys, well, the last team was trying to visit, I don’t know if you’re on the third, but that, especially is not German matter. So this is not a missionary. So the last 3 was trying to do was trying to understand if you have N hyperplanes in the space, into how many regions can you speak the space? This is the question when the last is trying to answer.. So he is more about hyperane arrangements or line arrangements and plane or anything like that. But then it turns out, as someone pointed out in the session before, that you can think of a hidden unit, a splitting a space into two, right? In the region where the hidden is 0, where is the region where it’s active. So it is called active and active. So you can think of it as being a line or a hyperplane, if you’re in a high dimension space, right? So then the question is Rashi is asking is exactly the same, right, right? each unit unit is the hyperlane, then sort of how many regions are you split in the space? Well, how many visions you get by have, uh, an hyperbase? So I’m just going to bother it because I keep, sorry, I think it’s kind of interesting and it’s kind of related to what’s going to happen with the people. So it’s good people don’t know it. The proof is actually quite simple. Now, I mean, I’m not going go through all it. about just going give it the large levelation. So the way the proof works through by induction, and it relies on the simple facts. So, like, let’s do it into the, I don’t want to do it anymore. right I miss spaces where gets messy. But you can look it up afterwards if you’re really into this one. But in a play, it really gets into account, the fact that 2 lines cannot intersect in one person. That’s all that you need now. It’s 2 lies, the money is likely one point, and if they buy that change. So if you do you start releasing a. you can only stream the space too. We have 30000 business, that’s it. So now you have to eat the minutes. So you add a new line. So you had only one line before, if you had a new line, that line can only get exactly the line that you had in one place is. So then you can count how many new region it creates. And it creates 2 new regions. And then you go to the 3rd unit and you say, okay, I have 2 lines, I have a new line. This line can only intersect the other, so you only have 3 intersection points because then you have 3 points. And then at each intersection, you can count how many regions get added. And you always assume so, you know, this is sort of you always assume that you intersect with many points. you can to maximise the number of regions you create, every time you add the new line in the in the plane. So that’s kind of like how the construction works. And actually, doing the proof, it’s quite easy for me to know the outside. And you just need to work up. You just see how you, I think design is, is that does that make sense? Why you want?one okay with that? construction? Okay, so, so this is very nice because in, well, 1st of all, it gives you a way to achieve this sort of maximum expectivity, maximum partition of the space, then it gives you an exact number, you know? So this will be the limit of what you can do with a single different layer, or a peak size model. So now the question is, what happens if you go deep? So what is What is the crucial limiting factor in the number of regions that we get when we do the last 60 year? So the limiting factor is with the fact that 2 lives cannot intersect with one point. The number of intersection points is limited, and you get a piece number of new regions per intersection point. So the growth of this thing depends on how many places you can get in place at. So what happens when you move to a divide detector? Yeah, okay, sorry. How many how many of this makes sense? So, it’s it’s a little bit, like, if you just look at the architecture, you know, it looks at a bit scale. But one thing you can do is you can look at a hidden unit on the 2nd layer. So I’m counting layer from the bottom up. The 2nd layer is after you had one layer. So what is the function that a human unit on the 2nd layer designs? What is it like mathematically? What does it look about? I think it’s a function of the 1st layer. Yeah. So, but what is, yeah, compound function? No, it’s at, I mean, all of you arrival, like, the word I was doing for, is, they have a pizza. Oh. So if I look at this before the activation function, maybe even if that close. If I look at this unit, and I write it down, it’s one in an ARMFP. But you just le your direction I’ clean your attention. That’s that’s’s a human image on secondary. So then if I’m if I’m looking at the name of the space, how is it going to apply? Like if I if I ask the question, when is this unit going to be zero? If I want to draw, yeah, the function, you know, H to Y equals zero, how is that gonna? It’s just an MRP. What does it mean? Why are people sometimes that? I use Europe, but like people sometimes what is that? I never mean. So, like, what do you, like, okay, if you go backwards, then you have, like, what do, what do you call, like, when you have an output of an RP and you say, well, you also do for 25, and we draw that, what is that? I mean, this binary classic. That is the decision already.. Yeah, like if you say, well, what is this your name? that will look like a decision about me, right? And it just looks at the decision boundary repor a single AM P here. And I said that this good side of preides. So if I take a 2nd unit or uniform 2nd leg, there is just a one layer in my feet. And if I ask the question, when does this music give us a const, It doesn’t matter the constant. That’s just sort of something that we can. I can tell you one people 0 when people prep five, it doesn’t matter. What it is, it is, that’s almost like a decision boundary. and then is the T function in the eospace. right? Does that make sense or am I lost many of you? Can you say it again? Can you elaborate? Okay. Okay, so, if we have a single, then we have an MLP. or a single, right? One layer, I’m going to give you the single opt, right? And we want to draw the decision boundary of this MRP. So the decision boundary, you know, it’s sort of a very overloaded term, but we use it in terms of prediction or whatever you do. I mean, a decision boundary is asking the question, I mean, if I write them on the board. if I have maybe if I write on the word, I make write the but you’ll find out. Um, So I have H1 equals like WR. I’m gonna skip by seat because of we devices are not important. And I’m going to use this one for it, I will, because it’s easier than right in well all the time. So it’s, But instead of WL, I’m called WW, you… Um, 2 W one X, right? So this is this is a one layer of health here. without biases, there specific hurting. So as I said, we can have viruses that help, but the Indian amount looks simple. So, decision boundary of this model usually is asking the question, where is age one equal to some transcript? So this is the problem you’re trying to solve when you want to draw the decision language, right? So, for a linear novel, when I ask, when is HR, we focus on constant, this would be a line. And it’ll be the decision bound. I mean, that line will be like .5 and then you brother run. Well, one there, I’ll be. You can, you know, we can do the constructions we were talking before where you can look at the difference even when it’s that are coming here. So basically we put all this stuff Z and, you know, you have, I don’t know, you have K units from D one to Z K. Each, the kind of place the space into a line, because each, each unit, ZI, is a linear model. because it doesn’t have any vibes. The linear model GI is sitting a novel. It’s a linear version. So if I’m looking at one, GI equals to zero, because that’s when relative kind of switches behaviour, you’re gonna get a line. So each of these units is alive and it splits the space into into parts, each line, right? So now when I’m nearly combining these things, I get my piece sizing a function, by doing that trick that I would say, you know, you have all from Z2 to ZK, all of them are zero. It’s a Z one that’s active and that’s alive, then at some point G2 becomes active as well and you can change the slope and it keeps. so when I’m asking this, but I get is I get a precisely here comes. I need that moneyor. So that is the, the functions you can expect from a single layer of LPU. Now, if I go to a 2nd layer, so I have Yeah, so let’s say, I got age 2, equals W3, or Sigma, W2, Sigma, W1. my name for about. Thanks. Hop my my name. So this is just repeating the same thing over an idea. This idea is just sort of age one, I guess I called it, right? Because the thing come about. This is H, right? So, um, I mean, you said Chunk here about the state. Yeah, I get a bunch of H ones. I will leave exam with an index up word. So I get H one one, K one two. I get 10 of them, right? So I have ad minutes on my 2ndary yard. Each of those units is a different function that looks like each one. So each of these H ones here are a piece of a linear function by because they are just this guy, right? Does that make sense? Okay, so then what is the process that we’re doing here with W3? So, well, with W3 and the Red. So the real, what he does, it splits the space again whenever H one is negative versus H one is positive. So, but, but, uh, But the way it does that, it basically leads the space by a PC linear function. So, for example, this blue line here. I, my subnotation, so what I call a drama, the board, the college to hear. Sorry for that. But this blue line. This blue line here. is the besides linear function that corresponds to one unit on the 2nd layer, and it’s placed the space into 2 regions. The one on the left and the one on the right. So if we go back and try to answer the same question we’re answering before, it’s a less key to them. We have a very similar missionary. We can bring back we can do reduction, we can start by just assuming we have a signal unit on the 2nd area and then I keep adding them one by one. The only difference is that now instead of having lines, we have piece-wise lines. So what is how does that change? What are the properties that we’re expecting with with the theory before? Is, understanding correctly, we are spreading the only related space, so between a square number of these index resources, right? Like if I have been the best day, like 8 for 5 and the 2nd they have K to point and between 25 different tool. Your… Not Italian, but very correct. You are almost on point. that’s sort of what the project would be. So like just to some of the more fantanticic way, like just working into it, you know. Like what we care about in the past is was, in how many points and to life can intersect, then it was only one? Now, the question is, you know, how many points do piece-izing or functions can be designed? go. And the answer is many more. Like as many as you want. Like if you can, if you have the choice of defining the value you want, you get a lot more intersections. And whether the truth of it is the same, the time you intercept, control so many regions you get to add. So the trick now is you want to get this piece of cycling, a punctuous intersect with as many points as you can. And the more the intersect, the more I mean your pieces will get out of it. But at least number of intersections isn’t founded by the number of between units in the second day, in the first. Yeah. I know how you found that. Yes it is. So the question now, so this is the sort of the machinery that’s being used. So the question now, by knowing by tracing the number of units in the layer below, is like, can you build a construction that leads to a more intersection that you can get from a shallow model? So the, you know, like there is a lot of question about what is fine and what is up there. So like one simple approach to make things easy for us is to say what you want to control is the total number of units. I have the same number of units in the channel model. We’re just saying the keep. So if the shallow one has plenty even in it, the deeper will have ten and ten. And the question is, if I have 10, because otherwise I said, I did, one keeps hiding more time, because I’m not complaining. So if I have that and that, the question is, can I build a construction that 10 and 10, that leads to more injured that a longest 20 can have? And I know the number for the one is the last ever I can plug in instead of BM 20 and I get the p number. And now the question is, how do I construct something? What I’ve exploiting, that this, this, this engineer function, this is set in multiple places, to get more regions. I mean construct something close to what you said, like get sort of thisonentialial growth. But this also the machinery I’m trying to kind of get you guys to think about it. This is the construction. So this is a paper. So it looks a bit likely, but because we have the time, I want to try to get a trade because I think it’s actually pretty well. And then we can we can take a break after the construction, because probably people will be tired, with this kind of condition. But but I really like this kind of stuff, so hopefully I don’t know how many of you like geography and this kind of stuff? as many? Okay, so hopefully it is a really, really bit along, but I think it’s a really nice part of the, of, of this, because when this kind of exhaust started coming out, it really made a big splash, because people were very unsure what’s going on. I will doubt at the end why this action is non-informative. It’s an interesting construction. Everything is nice, but it turns out that the reason we’re using our networks is not because they have more linear pieces. It’s actually that’s not important at all. And that’s not the reason why I didn’t have perfected by the international one. But for a long time, people believe that’s the reason, and then that’s not what we talk about. Okay, so how do we do this? So, I’ll give you an example if it kind of becomes. Um, So, I will not use Remus because Remus are a bit too hard. So I’m going to use the absolute value. I’m going to have an activation function, you have to have value. So after that, you know, if remove the sign, and I argue any other mother can mimic the author value by using 2 levels. So if by decide and get the absolute. So it’s gonna be. So then what is the construction impact? So, Say I have the claim, so I have Exxon and I so, and I apply up to the value of the both of them. What am I doing is, you know, the, um, boundaries where you have zero, splits the plane into padrans, so you have 4 padrans. So when you apply to that, you’re actually folding the ps on top of each other. So the point is, if I have, uh, sorry, if I know, I mean, just read this and write this on the bottom. I mean it helps. If I know that the, um, on the 2nd layer, our college age to, is I get the values 37, I don’t know if in the input I had 37 or I had -37 or I had 3 -7 or I got -3 -7. All of these points get mapped to the same thing. I cannot, once I applied after value, I don’t know which one was it. And this is what I mean by folding. It means that any point, in any of the other 3 quadrons, get mapped to a particular point in the 1st class. And when I’m looking at this, I don’t know which one can, which, which, which one was. So then if I have this function that now leaves only in the 1st background, so I have this new function now. And on top of it, I put another linear projection and sort of a logistic regression. I tried to do the decision boundary. And I do a decision boundary that looks like this. So, the sun island chucks. If I take each of these points and I want to trace it back to the input space, this can be any of these 4 points. So therefore, basically when you trace it back, you end up copying this line symmetrically into this world further. I don’t know if that is potentially 9 So the whole construction here is you take the space, you hold it into a 1st programme, you use the biases to shift things back to the negative values. And then you pull it again, and you shift, and you pull it again, and then when you draw the cision boundary, the number of linear pieces is going to get multiplied by 4, you know, every time you go backward, and it’s going to become highly symmetrical in some kind of weird kind of ways. We’ll get a decision one way that it was very weird. Um, You can actually, you don’t have to, maybe they do, um, um, wait after the value, you can just start from 10 years. I am worried that this is becoming very technical, but you can spl it into many segments by holding segment wire and this is sort of the constion for that. Um, So, so the way, I mean, this is the actual value thing. The way you could work is your techie. So I mentioned independently, and then you start doing your sweets, and then you connect. So this is a very kind of structure. So this is a proof by construction, right? So what you’re doing here is where setting up the weights by hand in such a way to maximise the number of holes and the number of spreading, and it does exactly this thing that I was saying, right? It treats the space like this and you keep doing me like this. Uh, And sorry, I was looking for the number. No, I don’t have a number. Oh, because here. This is early numbered at the bathroom. This is the maximum number regions you get. So it has an exponent and there is a product to the number of layers. So I, this is for the paper, it’s not for the class, but you can, you know, you can buy, you can look at this now. I don’t say I but we can look at this and you can see that as you increase the number of layers this is becoming a condition larger than what the travel model can do. And the whole point of this theorem is to show that the upper bumper, the model, is a lot more linear regions. Another question? Yep. Can I group this from the previous theory that you mentioned? Yes, yes. I mean, this is exactly the proof. So the proof is that you are exactly our least. So the absolute value is a way of constructing this, bicizing your functions to intersecond multiple times. It’s not the only way. There’s actually people constructures. You know, everyone comes up with their own different attractions. I find the absolute value the most interesting way because it’s sort of highly it’s easy to visualise mathematically. But all you’re trying to do is to come up with a scheme of how to define the exercise in your function. So as they forcefully intercept in those many places that you can. And depending on the on the scheme that you find, you get different bounds. Some bounds are better than others. But this used to be a treat for a while. People were playing coming up with schemes of how to transcribe these things and give the number. And this is the more magic side of machinery. Um, So you get this number and the number is bigger. That is not the most interesting thing. The thing that I think, sir, this XY is… The thing that I think it’s interesting and maybe maybe I’ll leave you with this. Maybe this will help, is it offers you a way to understand what a neural network does. And it’s not exactly correct. I can go in why it’s not the exactly correct? But you can think of the layers of you thinking of the deep lighting is doing some kind of very Italian. You can think of the deep layers, which is folding the space on top of each other. So when you’re trying to decide a very complicated decision boundary, what you do is you find a way to follow the space. So that when I do a linear, because of the output layer, it’s always going to be a linear decision company, right? The output player is linear. So when I draw a line at the top layer, right, when I go backwards through these holes, deadline is converted into the highly shaped kind of decision boundary that you want. So this is how Mechanically the deep lining works, right? So like all the layers are folding in the space on top of each other in weird ways. So that when you draw a line, when you go through the folds and you unfold it and then repeat it sort of a mathematical way, you get the decision boundary that you want them. This is sort of just a really bad problem from by time. Does that make sense because that’s kind of interesting. yeah. So can we say that instead of learning just the classifier, where actually the deep neural network is actually learning a representation of the data that makes it easier for the classifier to split, right? Yeah, that is what’s happening and the way you land, the presentation is to this definitely process of, um, yeah, I mean, that’s why data keep. I mean, that’s the role of the layers. But in the end, yeah, it’s a different representation by that senior thing becomes a deep sort of. So you have a question or is similar? All right. So this f but we don’t want to far is the one that goes for construction I waslining and I’m not to do that, but it’s – it’s not that sort of important. So I had a question on, okay, I have a question on, maybe we’re going to take the big battery session. But the question is, yeah, I don’t know if anyone can answer this. It’s actually easier than you think. But a question of very okay. So this is supposed to be a feature of how the space is being spl as by sounding your own effort. Um, or point outside, so that’s usually important because we’re only talking about chasing your own efforts. So the question that I’m trying to ask is, if I go on any of these dimensions for infinity, so I’m going on this axis order for infinity. From some point onwards, how is the behaviour of being on a natural and on the life? there’s something that can be described. There’s something that can be easily set. Because anyone has implication. Can you repeat the question? Yeah, so I have I have annual that, right? There know about the c. right? But they acup one dimension case to make things easier. So I’m only by one of the dimensions of the input. And one time, you said the behaviour, one defined on the network on the line, it could be easier, so you don’t know what I miss. I’ this is the plane so that’s why I was still. So it’s in the final one, it’s on the line. What is the behaviour of that function as I go towards infinity? How is that function in another back? from some point one? Is there like is there any pathology that you think is going to happen to the new network? Yes, Iasticron that work’s going to be a straight there. Yeah. So they because the number of the regions is fine, right? It has to be that from some point homeward, you’re gonna stay on a linear piece.. So the way the way the space is going to be split is into many linear regions of finite volume, and with some regions on the edges that have even the volume, they set infinity. There’s no way around it how to have a fixed number of near users in an infinite space, except that some of these people need to have, I think, importantly, something. So the beh network has to be here at some point on. Oh, very normally, yeah, it’s just, but you do it from the previousization, not that this is a final number, so you should have, uh, it’s not that I see. But, like, assuming we got this answer, what does that mean in terms of expression? Like what does it mean? What can we not express? Or what will be impossible to express in whatever neural net? Any general outside is confusion. let me tell for moving system distribution and do know anything to have you can’t more anything of you. Yeah, yeah. Just to be a bit more specific, give an example, you cannot, you cannot model periodic functions. Because because you have to be liar from one point over You can’t necessarily come. So if you’re talking about a team that cannot be expressed. Like you an address cannot express their functions. another will never be able to press asideise. Like you can do it for any like interval if it’s in domain, that’s what you’re value for. But you would not be able to do new failures. No matter if my contractions have not been yet. They cannot be fair, right? This is one of those kind of dry results that, you know, people say, like, oh, you know, maybe develop myself. But it points towards, you know, there are some invitations. This is sort of not kind of mathematically move it like you want. And this feels like a very simple limitation, you know, some kind of poor thing that you can’t do any different. stuff. But yeah. Okay, let’s take a break for 10 minutes or something and then we’ll go back and. Probably moved to 5 time probably. Nice, daddy. No, no, don’t. Instagram I mean this is the thing about generally like when it comes from generalisation You’re like a structure. And you hard for you hard for part of that structure, but you don’t know how to distract one piece. You can use learn PDBS and as the reputation of the period you have a little bitment., exactly. Exactly. So there’s a way…. yeah So attention can be taught about the particular kind of that you buy. Whether… Yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, is still the right induct device for anything. It’s like my… And that the price of the election is the correct one for… I’m not sure. But definitely like dimensions Yeah, yeah, yeah. So, so, I can get to this kind of assign me. So, what was these? Yeah. At some point, all these lines. Yeah, you can… Yeah, yeah, yeah. Yeah, yeah I mean, there are But yeah there are So Perance perform not still the high down but eacha they call it different they try to find it they completely different, but. LA work The reason why people are not joking on them is because they don’t necessarily work better. So it feels like there’s no point in changing sort of my flight that being as different. Exactly. So like the hardness for itself so this is like a something that a technology so far that hasn’t happened. And I believe that, actually, there’s a way of showing that there’s a room in the open, there, not a creation. Yeah, the reason why this is man, exactly, exactly. Because one of the reasons that I want to do is called an entire area, and most of our point. And inami, you can’t do that. So like, the technical point of least polards the ex math as well can just coll Yeah, yeah. but like in of so there’s actually a mask where people that doing I might support the people thing if it are getting or not. At this fact, it’s like an area is not just a point, it’s a big normal. So apparentlyly because that half this is not another econ. But as’s the only kind of health that. can also collapse point basically. Yeah when you mentioned he But otherwise is so right now. yeah. I’ll volume my infinite. Into the red quesimi, they more bad, outranges, infinite. So had a yogit, you know. I classify the cool linear locally. Who was that? A lot of… It doesn’t… You have to be sort of a little… So nice. Okay. Okay. right. So um Yeah, I think definitely to kind of go back over all these construction. So the film also, yeah, I know my art of another reason then I mention. I think it’s kind of questions very extremely important around the time it’s separately more. so it was the problem. And, uh, no, no, 13, sorry. OK. So maybe 14, fifteen, I don’t remember exactly. Um, and the reason for it is because, um, We started getting things like this relation, where I was going around, and maybe you tried. The whole idea was you can make a deep model and compress it into a shallow model. So there was a big, sort of, question in the air, whether the, the beat in the deep architectures was useful because it helped learning somehow. But it wasn’t needed because you could always do what you could do in the shower, you can do for it with a demodel. And I think these kind of constructions, I’m trying to argue your position, right? They’re trying to argue that if you fix the model, the size of the model, like in the limit, none of these questions make sense because in the limit that between these universal personal equipment. But anything can appx toimate anything if you the number of whether they shall or deep, whatever. But when you have real network that you actually use in practice, it’s a real question of whether, you know, Am I better off to make the model shuttle work? Or should I make it deeper? Or like, should I do Disney or 10 layer? So I think the, I remember participating a paper. So I’m actually a poker newspaper and I say that. Anyway, so like when we were presenting the papers, it was like a newspaper, I think, because it was funny, there was another tape I in parallel that was around insulation. I was trying to argue that it can take any remodel and converting into a shovel once. And it was sort of like one of those points when you were looking inside a poster, the one in that part of it. They’re like, newspapers are saying your opposite. Like, where is the truth if you’re talking to do this for the same time with this stuff? So it does, um, it was kind of like a, uh, like an interesting theory in time. But I think it is kind of a… Like it might feel technical and obviously the proof itself, the construction is probably not useful for you going forward. Like I, I struggle to find a point in time when someone would ask you to build a construction to show that this model can represent, so like that. But I think the intuitions behind it, or the machinery of how these things work, can be quite useful when it comes to questions of expressivity where it comes to understanding what are the limits of these systems. Like, for example, like this question that I in justice in terms of our friend, it kind of gives you one of doing how this works out. like, you know, this question of periodicity, right? That we are very, like, meaningful question to ask where you can say, you know, I have some, like, biological signal or some, I don’t know, some signal that I’m modelling. The signal itself is prioritic. And, um, you know, you’re trying to continue the transformer. Somehow it doesn’t generalise out of domain the way you want it. And there’s a fundamental reason for it. Like, you don’t even need to run the experiments to see that that happens. Like if you have the right information that you understand is the system, you’ll know beforehand that there’s no point of even trying to do that, right? Because if you tongue with that, like there is a lot of this kind of things that are happening, like there’s going to be a matter of popular example that we’re going to go over at some point. But it’s another soft match, where like anyone who looks at the formula of softmax, carefully enough, should know, maybe I can ask this question, but I think you guys know about softmax, exactly, speak about it sometimes in general, most of them. So, what happens if you apply soft match to an unfinishedly long factor of values? If I have a ve, I keep increasing his life, you know, keep added load this way. And you know, towards the limits then those infinity have add know. And I try to normalise the result max. What what happens? Oh, that’s amazing. Yeah, I’ll be distribution, of course, a normal situation. I will approach sorry, what? Normal distribution.. nice some of components will be used to see. So it’ll actually go to uniform, not normal one. You are right, but you are never behind it. So what happens is you this this, the – the weights are gonna become important. So your distribution is gonna, uh, personally, use uniform. Why is this important? It’s important it came about that people were discussing about LLF and increasing the size of the contract. You train another alarm on some context, we want to have a 100 times bigger complex. So what happens is you increase the contact? 3,000 the attention weights of bec become uniform and then the attention breaks down because with the attention right are uniform, then there’s no point of the attention there anymore. And the whole model breaks down. But it’s like one of those things that like, you know, I mean, maybe I’m making these sounds too, too chill. It’s not trivial. It’s like, there was a paper about it, so it’s, and that’s the new processor that I sent you. So it’s not like, you know, people laughing the terms of the girls around the paper. But it’s one of those things that it’s really down to just sort of looking at how the formula reacts as you change some variables. And in some nights what happens? Okay, so there some details about why this is happening. Sorry, I’m being very I was trying to make a po I being very, very fishy and probably. So the issue that is not just that you have a little number of budgets. That’s not the only issue that has for anything that needs to happen. working important because theologies are fine. They’re not allowed to have viol. And this is true for, you know, on that part, because in your efforts, we have finite weight. The way adverts work all the intermediate values in the your network are not not by necess. Because once we got minus infinity, you Viet non instead of form lot of bread stuff. The whole point is everything is fine, right? So because all the doges are fine, right? none of the probabilities are going to be zero. Because for a soft max to get a 0 probability on one of the entry, you need to have -580, that’s a logic. That’s sort of how it works in a normalisation area. So because of that, as you add more and more elements, you have a realty mask that needs to go to them. So as you distribute property masks, like the peak is going to become short and short and like everything has to go to uniform. There’s no way around it because, you know, you have an infinity number of pieces where you have to give the laptop probability, like you’ll end up everything being sort of unipoint. So that’s sort of the machinery that is going on there. So this is our sort of, I mean, you will revisit this much later when you talk about transformers and stuff like that. But I just sort of wanted to say that like, this kind of mathy things, it might look scary and boring, but I think there is a value to that. So that’s why I wanted to go through the whole yes. you should like before ingredients to college experiments to our inst. Yeah. So, I’m going, what kind is you about? All this on ideal magnet videos? Like, many very similar, but, I mean, the authentics are easy. Yeah, like, where, where, where, where, where, there, there, there, there, there? I Yeah, so it’s a very, like, during a very strange time. So I think I mean, these are changing systems who work surprisingly well for some things, and then they felt catastrophically for others. And I think understanding where the boundary is, is hard, but it probably was the most important aspect right now. So, you know, a lot of these agentic behaviours, the systems are not really operating out of the main. And maybe this is sort of the key component of this. So what happens, uh, with neural efforts at least, is that, um, in terms of expressivity and sort of their their ability to to do things, as long as they stay in the domain, where you have sufficient data, you tends to be fine. Like this whole periodicity thing or even the sort of soft maxim, both are scenarios where you push the system out of domain. So the question is the agent, the agent scenario is like a multi-agent scenario as well if you want. Maybe, at least in terms of formalism, I feel like a multi-agent intitu is a lot better defined than you tend to become type from, like, the term. Um The question is, when you have the sort of interactions between multiple agents, like would these drive any of the component out of distribution? And I think technically it can easily do that. In practice, at least how they’ve been used, it seems it’s not as big of a problem as we think, at least for the scenarios. I haven’t, I mean, I should know more about this stuff, but I mostly use that sort of importance scheme. Well, like generating food and stuff like this. So manage the right both and U I play a little bit with this. As far as I’ve seen it, where it’s okay, it’s like there wasn’t a new kind of game of. But it’s really about sort of the rules of interaction between the agents and whether this can end up having some kind of dynamics that appears out of the domain and pushes the system in a language that they don’t know. So I think that’s where the keys and it’s, yeah, like I don’t know that’s something that’s easy to figure out or not. But that’s how I see that whole life space of what people are doing. um It’s also for LLRs I mean, maybe what, you know, people, it’s hard to understand what’s up to domain and what’s in the domain. Like and then for even for very pragmatic reasons, like the data is always secret, but we don’t know what any of this models has been trained on or what was nothing on. And and all of this kind of element, it makes the whole thing even worse because it’s really hard to even start building sort of testing scenarios unless you know the secret source of how they revealed and no big company would ever tell me what they did in training. that’s sort of hard is. I don’t know what Gemini is trained on because I’m not part of the Gemini team and also within Gemini teams, like teams and teams and like at the end of the day, you know about this little piece you’re working on, but you have no idea of the whole picture and sort of where the meet is coming from and all of that. It’s it’s a bit of a like a big perform many. But anyway, I just wanted to have that kind of trying to be justify this. Um, I mean, I, uh, will probably have one of the homework serve around this kind of math games, probably more than one, so. So hopefully it’s not as scary. I mean, I don’t know. But um But going back to the point that I was telling you, there was Facebook papers to work on predicting each other, right? So there was this guys who they could compress keep models into shuttle on and then it does us are they theles and the 1 thing the shadow models head out. And the question is, how can this both be true? And the answer is, is the facility even a reason why we do this learning? Is it the exc? And we shall a model underperform because of lack of the rust. And obviously, I can’t give you, no, this is not a kind of answer, it can be sort of mathematical. answer to it, but you can give a veryical evidence towards this. And and the particularical evidence I have a paper about your vehicle events, but there’s going be sl later around it was never an issue of capacity. So the emprical evidence is, for example, there’s papers that would try to see if we can memorise image met, right? So you take image that that’s a 1000000 images, considered be it as a person. you know, you say yourself like that. And you replace all labels with random numbers. So now there’s no more any structure that you can use to learn a label. And you try to see whether you can feed that. And you can fit it perfectly with your. So usually people take this as a sign of like the model has enough capacity to memorise the entire set. It does not need to learn some kind of generalising function. But then when you kind that language that you still learn something in generalises, you don’t like to memorise, right? So, so, the, um, and then there’s other items like this where you start trying to see sort of how much people information you can, like, random use of information, control, you know, all of them. You’ll find that the number is actually surprisingly high. Even for show models and you can see this, I mean, it’s hard to tell, but you can kind of see this from the formula as well. Like if you try to plug in those portulas, some realistic numbers for the number of human minutes and and whatnot, because some really astronomical numbers are there. Obviously, it’s hard for you to kind of reason about like how many India regions did I need to represent CPR? I don’t know what they listen. But another marriage is like ridiculously large. Is it like numbers that make no sense, you know, uh, more than the number button, you mean, speak like that, I mean, I don’t know, hear things like that. So it’s not a capacity reason. There is something else, and this is all the speakers are meant to show in a very non, right? But there’s something that comes out of the deep lab, which is because of how the machinery works. So you have this sort of falling of the space and then you draw a line and then your line gets replicated, you know, all the regions that are falling on top of each other, you get a decision bother, you have highly symmetric. So what I was trying to do here is to say, I have this, and I draw some lines in it, and I unfolded, assuming that I folded sort of in a square version I was saying before, I get sort of ashamed and looks like this, which is very highly snatched up. So the intuition and this is even in the original paper that we did. We know this in the discussion at the end. The English is that it’s actually not the number of regions that matter. What matters in a deep architecture is that the slope of these inner regions, they’re like symmetrical. They tied to each other, and it comes from the structure of the layers. Yes. by the scope of the link. So if I’m doing a one B, I mean, you know, if I have, if this line goes like this, this one has to go like this, it’s because they’re like a replication of each other along. It’s so there is there is sort of this very strong symmetries between all of these linear regions. which comes from how you fold in the space. Obviously, it’s not going to be like this, this assumes a very like chess board kind of drawing of the space. It’s a nice square. That’s not what you Netflix like. But you get a very highly symmetrical region and But in addition that I think evolved over time, and this is not not that it’s being solved, because people are not looking at this anymore, as I used to look like 10 years ago in terms of papers. But any reason that we kind of emerged from all of this was that is this symmetry that matter? It’s not a number, but the issue is, we don’t know how to mathematically describe the symmetries. We know how to come. That’s easy. But how to say that like, oh, we had this and this structure, it’s something we don’t know how to we don’t have the mathematical appro story. and that’s why it feel kind of right?. I understand why it’s like like patterns in but I understand why is that I mean, why is this so much? Does it make complication more efficient for this? general. you look at So surprisingly, again, like hopefully I understand why. This is high dimension of data. It feels like the natural data we care about, the decision boundaries in natural data, actually, and whatnot, is highly symmetrical somehow as well. So this access is inactive bias. Basically, it is for you to learn particular style of decision model, which matches the ones that are found in real-day band. There was a point, I don’t know how truth there is anymore, but it’s definitely true if you try to look more at generalisation than anything else, where if you try to learn your networks, on synthetic pay pal that is overly generated, you find that you have a hard time to do that. Like it used to be, well, like people have kind of figured some tricks out and make things easier, but it used to be, I don’t have a feature of it. I- if you try to train image, uh, neural network, some very low dimensional data, two dimensional. Usually have a hard time writing them. And you know, not to extend to work better and better, the higher people ever started. Compared to other mentors, they’ll do better when the input is lower dimension. because, you know, in higher dimension is in fact you have the first of the nationality and all of these states between in and all that, ne letrics are almost the opposite. They prefer high dimensional data. The higher dimensional data the better, right? And that’s because probably, I don’t know why, so probably because we had a national data, there is some property of the decision bomb rate that they get explored. You know, like how, for years, as you increase the number of dimensions, all the volume goes on the surface. And there’s probably something like that happening with decision boundaries as well, right? We go inside a national, something happens in the decision boundaries that somehow the neural network, you know how to exploit. Like, I don’t I don’t have any theory about any papers, but there is a immediately absurd fact, a high dimensional, very, um, and for the wherever it is. And it’s also like, you know, efforts were very well on natural data. So this is why people think that there is, an inactive bias in the architecture, that we don’t know how to explain exactly the words, but that impactive bias is much more natural, natural, data, depending on networks. Um, and um, another, I think people already know about this stuff. We’ll talk about everything, map and filters, particularly to try to make this point, to say how people at the high level describe this in that bias. So, uh Do people know I say this maps? Some some of your, you know. So there is enough look like this. If you put the typers, um, there are time to interfere, while she is being made. So what do the sales maps are about? So it’s really a sensiv analysis. So really what you’re asking is, if I have this image, and I change this piece of here, would my classifier say something? And you just want to look at which are the pixels that are more sensitive to your prediction? And you colour the pixels according to their sensitivity, and you get something like this. And then you use this to say, well, it says it’s a laptop because it looks at an apple side and it says it’s a cat because it’s looking at the face and the pail or whatever. they lose it, right? Um, This is quite popular. I mean, the defined version of this, like, produced in explaining ability, with all kinds of tricks and whatnot. But as a co-hort, it is a sensitive analysis matter, but that’s what it is. And Um looking at the back of information of the sc the way it works. So. Okay. So it’s really like the way I think of it, is literally you take a function and you do a data expansion. I mean, that’s sort of what it is in my car, right? So you say that my function plus the perturbation, particular perturbation is very small, is approximately the same or my function, plus the noise, whatever. But it, uh, the reality for that. So if you want to compute this mask, you just compute a variability, mobile to respect the input and that’s what it sends in. And you re normalalize the lab that it new things than like. So the drink and magnitude leads the manager of importance. Is there, like, any sense in which this is wrong? Like, what are the people so just relying on radius and trying to make a kind of decision? Do you have anyone has an intuition on why this has to have articulary route? Everything comes from, right? It’s scary about you. So, yeah, so let me rephrase that. I think I think this is what you mean. But, um, So 1st of all, here you’re dropping some turns, and then and then usually you grab those turns because your CMRs are in very small. But, you know, you drop in higher order times. But depending on on those I order terms, um, even if App Store is very small, like they might still be 5 meaningful. You have very high order, um, very high norm, higher order, tires, like curvature or, like, or, or, or so forth. There is a way in which this is sort of kind of drastic in our sense. So, um, that thing about a see point, I guess, or you guys know, policy point is, right? So, um, then you have this very, very simple problem function where you have a cook in it that whole signite. And one of the hidden unit is, like, there’s some like teacher engineering happening under the hood, and one of the healing unit is one, it was the word awesome, in whatever types of mule, right? But it is saturated. So like the weights that you’re set by hand or whoever learned the system, learn a weight of a 100 or whatnot. So the word asum is there, that unit goes to a very saturated state of life. Now, your decision might still be based on that unit, and it will depend on the fact that the word is, the word awesome isn’t there. Well, when you computer silency maps, because the, the, um, the CY is saturated, there’s not going to be any greater point through it. So if you write individual to CY, you’ll see that the greater will vanish. re vanish in the saturation. So when you’re gonna do this kind of colouring of the different words in the view, the word awesome is not gonna be read. It’s not going to be lighted out because the model assume that you do not depend on that word. And that’s just because the documentations comes in the site. So, but it differently, some things can act as a bias in your decision. In a way that they don’t honestly have a pregnant. So, like changing them a little bit might not sort of change or tradition because, you know, there is no such idea. But they are important for your decision. Like the fact that they are part of the reason to make decision doesn’t necessarily need this kind of possibility. So yeah, I don’t know if that example is here to anyone. But there is a sense, you know, yeah, it’s like.. Right, so I’m basically trying to argue that a model can rely on a future strongly in terms of making the prediction. So, okay, so just to give you a example. So you have the signal that is saturated. So because the C might is saturated, the needs of rate in close rate. So the gradient, uh, the tendency map would be 0, but they’re a teacher. But the prediction is being used by the model. Um, so so the the output of the model is really, I guess, it’s easier if I give you an example. So, um, Like the model, the model, the model could only depend on that particular feature. If you if you want. uh you can h- how do I this is the for example that I can give you some values for the weights so you can see how it makes sense. Maybe I can make that as a as a point and and send them to tell everyone later today. But you can imagine, yeah, I don’t know, let me try. let’s go a single feedated unit, right? So you have H equals C modes of 10 times future, whatever the teacher is. And then output is sigmoid of it, say, or like, I don’t know, one or one time it, right? So, and this feature V 0 or one, a feature is 0 with if there is no word, no awesome In text, and one, if, right? So now you have a text that says, ask some, some, right? So, because the text says the word, this is a word. This will be 10 same moment of time will be won, but in a very side rate is speed. So now if I’m trying to compute the silency maps, I have my output, I’m trying to compute the great end of my output, which you said, this is positive. They don’t, uh… So the output is part 10.5. It’s going to be point of onely close to us, right? I can make this moment, it can be like, for example, it doesn’t matter. So the output is positive. You try to compute a gradient through this stuff, but later here will flow to age. But then the gradient, so, paid by the radius of the output, with respect to age, then one of these 3 is greater than 0 because I can do my weight such that is – is nice actually. It’s – But the problem is that the million of age may respect to X, so in my case, in respect to that, the magnitude of this guy is zero. Because my heat unit is extremely saturated. And this is just a property of the signal. So, the computer variance through the CIA, the – the – the more confident the CIA is, the lower the grade yard. Because the, the, the gradient of the signal 8 is one minus 3 minus square, I think. So something like that. So, if you have, if you have that your, your, your seat more, your pen is gonna be basically one, then one minus one is gonna be, uh, zero. So this grad and here is gonna be zero. So when I applied the chain rule, the computer grader of the outputs, you describe to F to see if that was a factor in my decision. I will get that the stailency map is zero. But that’s the only reason you’ve made the decision, right? If you can look at your network, it’s a very simple network. Like if you remove that app, if you make up 0, it’s not going to work. But this is basically, I’m basically saying, okay, let me differently. Maybe this will be. Maybe this is the way I should have said it. Because of saturation states and all this other stuff, the um, the function that you’re modelling can be discontinuous at points. Or act almost like discret. And at those kind of points, the gradients don’t matter anymore. They don’t make sense. And this is kind of what’s happening here. I’m forcing this thing in a saturating state to almost be sort of, uh, th- So, so I kind of break the continuous behaviour of it. So, so I think it’s, I think it’s the same thing because drafting is actual opportunity history. It’s the same argument as having very high, known, higher wider charge that would be equivalent to that. So in the extreme, you know, that has very, but it’s just a lot of like a, I wanted to give this as a more like chromatic thing that actually can happen. So you can have these kind of systems that have like similar, they can have other kind of saturating activation function, right? Soft Max is a good example, potential ASL this issue, right? When you’re attending a particular topic very confidently, so your attention when you don’t lost all of that one, then the radiants don’t go with the attention. So then if you’re asking by using a sentency map, while attending this case, you will not be able to tell. You have to look at the attention where you can look at the derivative detention, right? That is the difference. So that like if you look at the attention w, you can see, okay, the weight was one on this particular top. So yeah, that’s the one I was attending. But if you look at the liberty of the output with respect to the top, you’ll get ready zero. So you can be secondly say, I wasn’t attending it. So this is kind of the failure. Is that clear or does it make any sense though? With me but I can understand it in context of more time age or whatever. But for example, I know it’s very question because if it is radio, it could work perfectly even go un sec any? Um For example, I mean, you can have, I’m trying on my mind to see how to build an exam after this. But forever, you can have a similar situation where the decision relies on the unit being zero. And that’s sort of the saturating state and kind of have the same issue. But you have to work with the bias and stuff like that to make that work. But I’m pretty sure you can make size that the model actually looks at a hidden layer and makes a decision based on the fact that nothing is positive, that everything is zero, somehow, by multiplying negative numbers and doing something with a bias and things like that. And when you have a situation like that, again, like if you look at the videos, because there’s zero, there’s nothing flowing, so you say, I do not depend on noise below, but actually that is the only thing you’re depending on. This is a problem with that not only near level. Yeah, it’s because of the noninea because of the infection of non-in and so forth. Obviously, you can’t use it around united to the models on the word. But, uh, it’s, um, I think it’s just sort of, I mean, I’m making a bigger thing than it needs to be out of this. I’m just kind of pointing it out because I know in the interfretability world, people do silency maps left and right. And um, I mean, there’s other places where people would think like this and because you get used to them so much, you start interpreting them with ground truth. So you assume that it can never go wrong. But then sometimes they can go wrong for simple reasons like this. And, you know, it’s almost like one of those instances where something goes wrong and you can’t really figure it out because there is something simple going on behind. So that’s why I’m trying to bring this up because I think in general, like you, you should always have sort of like, You should always be suspicious about anything when it comes to new America. Like any kind of story, any kind of theory, whatever, there’s always an angle where like the assumptions unfold or something breaks down. And this is just for sentency mark, this is sort of one example that I think can be pathological, and I think there are, I don’t have any papers I think here, but I’m pretty sure can find papers that kind of can exploit this and show how you can make sort of the wrong decision, uh, for the picture. There is a less popular nowadays, but a different, I made up this time. I don’t know if they make any sense. So I was calling that the other one backward propagation view of whataintance. This is kind of like a forward propagation view, which is how people start it actually when you are doing this. And the way it’s exactly how we build or who I trying to build that example of the experi just regression. The way people are thinking about things, right? you think of eaching and it that some kind of feature defector. that’s not the usual perspective that you dont have. Additionally, in the IBD, 2009, when I was doing my DB, we weren’t doing a SMS, where we’re having this perspective that every unit has a semantics. It’s a teacher detector. It takes some particular features, and the way she works in the computer, some kind of simularity, and then some pattern, which is probably W, and X. And, you know, the higher the distance, there are more access to much that pattern, the lower, the less similar. And then the activation function is some kind of normalisation, and it’s on a distigulation, or if it’s revel with some kind of thresholding, but anyway, it’s basically, but it preserves the dynamic of the distance. is good, small is bad, you know. And units fire when it’s 90 is maximised. So it’s kind of vacation. And then if you have multiple of these things stacked together, you can kind of use sort of a reconstruction kind of pipeline to try to figure out what it means in the input space. If youre Basically inverting, even though these things are not be veryible, you’re kind of assuming some kind of inverse of the previous things to figure out what is the field fact that the in is going to. And, um, I’m just bringing this up because this is another kind of observation. Maybe you’ve seen this maybe that kind of other the fashion, but this is like visualising the filters in a content right the manual network. So what you do here is you visualise the weights rather than than than the noise. And then when you go to deeper layers, you basically multiply the weights together and make some assumptions about the activation functions, which I didn’t collect. And, uh, these are used, again, what similar user standards and what the model is doing. Maybe this is the most difficult thing, the layer one, this is the most reliable thing as well. And here you can say like, oh, this looks like a board field 1st, right? It’s like a filter for the colour green. so I have this unit responsibility the colour green in the image. I have this, this, uh, goodness that it found, there is a lie like this. I had the swimming that responded, it’s a line like that. And then I keep composing them and get sort of these shape of like this phone too weird to wheels or this kind of weird pattern and then soft white. Um, Okay, so this was a long story, a long way about, but what I really wanted to say was about this impact device. So we both of this kind of isization schemes. People have started trying to look at what the different changes in other deep model. And the usual feature, so this is a really early stuff in 2013. And obviously there is a bias here as well. Like people knew what they wanted to see. So when they were doing this paper, they knew exactly what they needed they wanted to get out of it. But what they see is that the more in the 1st layer, they have like part of objects like eye, the noses, and then part of faces and so forth. Um, and then there’s this kind of, uh, vision of words, so this is word from Yoshua, where what they do is they take 2 images, like the 9 and 3 that are being classified as 9 and 3, and they try to interpolate in input space and in the. And stay away from the screen. Then in the in the area, in the space, or perfume, layer, and in the space of the 2nd human line and so forth. And the argument is the role of that, um, in your networks, even directly biased. And what it does is we have giratical representations. So your article is the standard neuroscience called side kind of training, you know, 1st layer, double clutter, the part about it and more complex object which was direct tired from there. So this – this is the mission that this moulding of the space that we described, is just basically makes the space ammantical. So any direction that you move is connected to some kind of semantical change of your data. And that’s kind of the intrusion. And then there is this own pressure on obstruction and anarchy. So maybe I even have a sl on this. Because then, like, not as much machine learning, but it’s more, like, general philosophy. But there is, if you look at the basis of people learning. So I’m thinking of like review papers from the IV 2010 and so forth from the film just starting. Um, There was this strong assumption that is coming from coxide, that you cannot do reasoning, or you cannot do any kind of, like, interesting processing of the data unless you have attraction. So this is sort of what the core of how people think that the biological system is doing as well, where you play the rose and relate that in beyond more and more abstract representations of it. And that is what allows you to pre the daytime interesting thing. So there is number of abstractions, there’s special obstructions, there’s sort of ironical structure that is being built through this. And if you look at the early theses of what we learn should be, these things were very great, right? They were saying that deep system, you know, for something to be called in lightning, it needs to be deep because it needs to have your actual representations that build on the representation to know it and they need to have this sort of abstraction being built up of more and more interesting things will go up. And this has been used for many years. I was back nowadays, if you ask people, they would argue, even today, that’s why the network works, because there’s sort of abstractions being happening as it goes to the model. To make excited is happening. but these are the news as a big reason for this. I’m going to mention here I’m not going to mention it later as well. There are reasons why I don’t know if this is a true activity. So I’m just going to throw it out there, but I will go back to this later. So there’s a few things that have changed through the years that happens in modern architectures. They kind of break this perspective. I don’t know if anyone has any. The most basic thing, that is crucial for any modern architecture, is key connections. and we’re gonna go as keep connections are about, but keep connection basically allow you to sleep. Instead of going from the answer, you can open the inpput directly to different legets. So because of the sleep connections, the representations are not Earlier, I’m sure what they used to be, because you can, the data flow is differently now. The data doesn’t flow sort of only one way. The other common example for this, but that’s a lot different, and I will get to that, like, it all, is, um, if you think about it in the model. If future model is not clear that they build obtrractions anymore. They basically go through a noisy version of the data and somehow from noise to more noise, more noise, suddenly something comes out of it. But it doesn’t look like there is any like… I mean, this is a bit hand lady and this is on an all right. But I just wanted to put it out there in case anyone is kind of interesting on this perspective. I think, like, this was the basic theme of representation mining, and iClear, which would become friendly. International conference and representation line, and is provided by Yosha, and American and others. And it was really because of this kind of thinking. It’s not clear to me, leave the representations we’re learning nowadays. It’s the same with the representations we are learning 10 years ago, and whether some of the things that they put in these things is about what a good representation is, it’s actually valid. On the box side side, if you look in the literature, it’s not that established either. The majority of people will say that technicians cannot exist without obtrractions and probablyactateations. But you have books and papers that argue that’s not true. And I think the octopus is usually given as an example that might not have sort of the right kind of actual abstraction in the nose for the least in intelligence and whatnot. But what security is stipulated? So anyway, it’s kind of like an interesting quality philosophical thing. and I realising we’re gonna run this amount of time. The other thing, I mean, this was something I mentioned today before, that I think I wanted to say, with the kind of semantical assignments of units. One has to be careful, and the usual rate, you can show that there is a problem with them, is by going to other serial examples. This is sort of one certain examples are. You can take a picture of a panda, mother is somewhat confident it’s a panda. What’s some noise that uses this picture, the look on how to say, and the model classified is something different. And these are someone’s modern examples where you take these scandal and now it says it the bathtub, that throw is a bird and then 4 is a 6, right? And these are, like, the way you fool the marvel is this very more perturbacious that you can’t really see. So you can potentially go back up to other examples. Um, And, and, and, and, and then start them, but, um, I don’t know if you fully understand adversarial examples. mathematically all the world. There’s been a easy where the other example is there because the system is still linear. This is a map, but you can enjoy the map. kind of intition is we need to have a very high dimensional infrosp price. and you can these your noise over all the events dimensions. Because the way the model works, it goes from a very high dimensional thing to low-dimensional, like 10 classes or whatnot, very low dimensional output, then you can potentially distribute that noise size that it makes a big difference. Once it is concentrated into that goes much smaller dimensional case. But you can distribute it over that, I don’t know, 10,000 or 100,000 dimensions that you have in an input, in a way that is not visible to be out. So and and the way this works out is because the system is pre here. So like whatever small noise you distribute over your input, somehow gets carried on through the layers of the model of the otherput. So it will… Might go back to this later on once we establish a lot more of the rest about neural networks, but, um, This is not the only reason you have a personal examples. There are papers out there that plly crack up, which is they come because the system is strongly available than engine here. Um, But, but, but yeah, this is, um, in some sense, this is one of the established way of thinking of that. But like the reason I brought them lot now was But using adversarial examples, um, okay, so there are 2 ways, you have to set examples, are sort of a discussion, it has so far. One is, by using my personal examples, you can show that units are meant to respond to some semantic or pain, like say, uh, this is a unit that takes an eye. You can make that unit respond to something like very different. I mean, that’s just a simple attack, right? You take the output of the unit of the classifiers, they’re classifying the real arrival, not an eye. And now you have an image that’s not of an eye, and now you’re using your unneserial attack by that. We haven’t started this. Well, my find of how you find this, of how you find this noise by – by either doing radio descent or something else. But you can, you can use your attack, you change the input in such a way that it’s still not an eye, but the model detected an eye. And the truth, I mean, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the, the works is that there’s always… Basically I’m trying to make the activations of the model like I’m assuming the whole thing discrete, I think as long as it’s continuous, slowly, people protect it. Maybe the fact factor norm is to be higher or some changes like this, but it will always be able to attract it. And, and while this is important is because, um, usually I want to use this as a wording of like this kind of people assigning semantics to units and kind of being very drawn to that. And then just like reframing the whole model in terms of, oh, I have this eye detector, this noise detector, and this multitector, and, you know, once I have these little things affected, the next layer will say this is the face and it’s done, and this is how it works, and this is my system. That’s not really how this system works. Those are not, I think there’s another normal detector and it’s not about completing them, right? And the other attack that is here is this interesting paper that they were trying to do, a different kind of attack, where they take around it, like is lava. And they’re trying to leave them out of the classify the images llama, but they want to change as to what the model is looking when it makes the classification. So the normal initially is looking at the face of the llama and classifies the llama because of this. And after the attack, the model is looking at this region of the infrospace, where there is no llama, it’s just background, and it detects the llama because of that. In here, I think this is the butterfly and I started looking at this corner of the image. So again, these are the chats that just say that just because a model makes a new difference based on something that you can interpret in one case, it doesn’t mean that for another similar input, it’s gonna make a similar kind of judgement. So this is basically just a worry about generalising how models work. I think in terms of the ability, for a given input and to give an output, you can do all this kind of stuff, security, and obviously, the curricular things and whatnot, and come up with maybe an explanation why for this particular input, there is difficulties or output. But generalising from that to saying that it will do the same thing for other similar inputs and outputs is wrong. Like they can be like this images. you can’t tell them apart, right? Then look exactly the same. The phys is the same continent, so the reasoning is completely different. uh the same for high Islamas again. But like, you know, so, so, so yeah, so it just meant to kind of, again, this is just meant to kind of give you that sense that, um, there’s a lot of people kind of say about me on that, right? So you shouldn’t trust that. kind of sure. Yeah. All right, look, maybe for the mission, is it possible? is it the you know the original made identified? Yeah, yeah, so this is the original. I’m a good regime, right, from the 45 year. Time did it, maybe, about. you can go backwards then you attack it backwards. Yeah, I think so. So I think, I mean, I don’t know, I don’t know exactly how that works. I think it’s a great and-based kind of thing, but I think you can always attractively change it any way you want. Um, And then usually these attacks work better, the higher dimension of the data is in the in the data. So, what? Well, for the New York argument, is like you have a super dimensionality that is much, much higher. It can distribute your noise over those dimensions in much smaller quantities and still kind of come together to form an attack vector that is large enough automatically. So it’s, uh, so if you’re in a very low dimensional space, it’s very hard, much harder to make attacks whether humans cannot observe by art. ones will start kind attack where like if I’m rout as a person and just so you, I would say they’re the same,. Like I can’t even tell what the noise is. Obviously, I mean, for breaking the semantics of a unit, it doesn’t need to be that you cannot notice the attack. I mean, there are most maybe scary thing about the neural network. I don’t have a flight on it. You can give an input that looks like this and the model can be confident that that’s a fan about or something, right? So the the new network would be very confident on images that are just Gaussian runs. Um, so that – that is – this is, you know, so that – that’s kind of telling you, like, okay, there’s no santis for that. It confused just on, like, random pixels, with the object that he was supposed to semantically detect, that that’s the media work. Uh, we need to think that That’s not what he’s doing. That’s sad. Like, obviously, um, In some sense, we’re sort of in that weird space where you can’t really reason about improve or research or play with architectures unless you make a simplified model of them in your mind that allows you to make hypothesis and allows you to do things. So I’m not saying you shouldn’t assign semantics, it’s a components of a UNI question. Assume that they work. I’m basically saying that as a researcher, you should do that in order to be able to come up with hypothesis, the experiments, and proposals, for me, I take it and so forth. But you should also be aware that all of them are wrong. And you’re inside of pressing the fine line of like knowing that your assumptions will never be true, but at the same time, there’s nothing else you can do, but assume that they’re true and work with them to come up with something new. I think the danger is when people become used enough of the assumption that they’re making, that they don’t realise they can be wrong. And then, you know, they hear sort of the talk to that it’s beyond what it should be. That’s basically what I’m trying to learn here. more or less. Oh, this place is not that easy. I’m gonna put it out there anyway. just not near. which one is this on? Okay. Yeah, it’s a bit big. So the question is, can you have? okay, so let ignore the details this is going to be a very hard problem. Let’s make it the people of more generic. So assuming there is some kind of control over the model side, the question basically is asking whether a shallow model can represent whether a new model can. And about the diverse, and the new model represents anything that each other models has. So are they sort of, this is one of the soccer class or the other. or like how do they compare to each other? In terms of the set of functions they really like kind of goes back to the media regions. There’s no ther. When there is a limitation. It’s that they, they’re controlled for for a number of weights to be sort of, because otherwise it becomes the right question. But otherwise it’ll be simple. But I just thought of high level, like, let’s ignoring how you control for the weight and like, if anyone has an intuition, if not, we can pick it up next time. I can, I can wait, a more deter. I’t have any thought. or like any guesses. What a lady, can A represent whatever we can represent? Can we do we have infinite neurons? No, no, no, there’s shakes. So in this case, the number of neurons sounds are the same name for this game. So like I guess, get A these. I maybe radio there’s no wrong answer. I mean, there’s a wrong answer, but I So, so, so then I, I, I believe I, believe that I cannot realise what we can. So that is the correct answer. Yeah, that. And that comes because we showed that a new model can have a commercial modin area. So because of that ANR represented, right? So, the diplomata have you not placed? But is the rever true is not represent any function that any can represent. Yes. that. It’s natural. And it’s not true because The way it gets more in a region is through those symmetries that are tied. Well, here there is no symmetries between these things. And if you ignore, if you want to check the number, maybe this is an interesting question. I say it hasn’t thought for myself, but it’ll be an interesting decent question. If you want to count the number of unrestricted linear regions that a D model can do, it’s gonna be weight smaller than the Shadow model. because basically you can mostly do that in the 1st layer. and the less layer the space is don’t do much. Or something like that. Sorry? Sometimes you need to leave your mode. Yeah, yeah, yeah, yeah. So you basically, the way you do this, so that you don’t introduce symmetries, you basically make this units to always be positive, so they kind of disappear. They all become linear. So then you end up with a shallow model that has fewer human units than the other one. So therefore you can’t refuse that. So I think this is kind of the reason I brought this out. It’s kind of interesting because people usually have this mental features that you want to sol it better as the shallows. So, um, you know, demod can do whatever shuttle on can do. But it’s not actually true. They’re just, you know, if you control for size, in the limit, they all do the same thing, right? You know, for size, there’s things a shadow model can do that a 2 point cannot. And I think that’s kind of interesting observation But yeah. They also can do more can view that. Exactly. They’re not overlapping. they’re different. I mean, they have some intersections, but they also have things that can that they can they want that. So, so I think that is kind of an interesting observation that, uh, um, I think it’s kind of useful to have the, I don’t know, for me, it feels like the food happens in the, in the model, never mind. Well, you’re making decisions about these things. with that I’m gonna stop today. Exactly on time.

LECTURE 3:

Picture itself, but also the optimizer. Will reshape the the search space in which you're looking for a solution. So, let me try to make this a bit more. Um, expresses slightly differently to make a bit more sense. Um, I mean, I'm gonna say something that's not true, but just to give you like an idea.

So, for example, the models like Spire Solutions, uh, just because of the structure. I mean, it's not true, but just use spicy as an example, right? So, the idea the inductive bias would be more if I train a D model? I usually get a spar solution if I dance if I train a shallow one.

I'll get a dense solution, so this is the inductive bias. This is the distance between them. In, in what is the exactly bias of the model is? You will get solutions. Okay if you believe the whole cognitive science stuff that I was presenting at some point, which is partially true, but not completely true.

It is a deep battle with fine solutions that are more compositional because they have this theoretical structure. So, the way they learn say to classify your cat is a first line Gabor filters. Then they learn sort of composing those four filters into part of objects, and then keep going so far, so it will only converge to Solutions.

That respect this kind of compositional structure. The solution itself decomposes things, while a shallow model doesn't do that. The the shallow model will try to find a template of a cat on the first layer and just match your image to that template because it does not have the depth to compose things.

There's no decomposition going in there. So, so the kind of solution you find are fundamentally different, like they might give you the same numbers, but they're different and ones that the D model finds because they rely on this compositional structure, they generalize better. So when you get a new picture, it works a bit better.

That is kind of the the argument and this sort of compositional structure of the solution. This is, you could argue this is the inductive bias of of a deep architecture compared to a shallow one.

I'll come close that to make sure right here. Oh my gosh. Uh, so usually, right? That's a good question, so I mean, okay. So, this election is a big field, and people have tried all kinds of things. The standard standard things that people do is they put Kyle. Between the output of the teacher and the student, so it looks at the whole like, no, just basically right.

Um. It matches the distributions like the teacher outputs a distribution over the classes, and you match the distribution in the student that is sort of the standard. Um, distillation process, but you can do all kinds of things. I mean, there's all kinds of variants you can play with the temperature of that distribution.

You can do lots of tricks, some help, some don't. Um. But typically, you are not. You know you're not using the teacher to classify an image, and then you use that as a label for the student, because then it's? Well, you're still. It's still not the same as learning on the data, because at least you make the same mistakes that the the teacher does.

Which is actually quite informative? Um, because? Why is it informative? Okay. Even that will help you and I can explain it a little bit in the sense that neural networks they have a tendency to learn smooth functions. So they don't like fast transitions. It's just sort of by the construction of the neural network by the regularization that we typically use.

We converge to, like, low Norm Solutions. It's a low magnitude, and therefore they like functions that change very slowly. Uh, so there, so if you have very nearby images that have completely different labels, this would be high frequency, right? This would be this will look weird. Um, a teacher would not learn that with mislabel.

One of the two images? And that would actually make the life easier for the student. Because because? In some sense. The teacher is even more constrained than the student because of the article structure and and and the student might have the ability to easier learning loud noise, like overfeed to the noise.

But then, I mean, this might not be noise. This might actually be the the actual true function, but in some sense, is like when you use the teacher, you know, the labels coming from the teacher? Those labels make more sense for a smooth function, so it makes it sort of life easier for the students.

So, you still gain something, but that's not why people do. In practice, people really minimize the Kyle between the distribution, the output distribution of the teacher and the student. And then you have the launches that you have all the other information in there. Yeah, does it mean if I have human entities and this human entities can make soft labels for the data set.

Yeah, like, for example, the famous image of cookies and dogs. Yeah, can I make it intentionally more confusing to get this soft flavors? Yeah, so you can definitely. And there are techniques to do this, uh, that help. Not as much as distillation, but, and then, some of them are super silly.

So, one thing you can do is. Instead of having a one pot that says this is the class you put like 0.9 for the class and then just uniform for the rest, and this already helps in learning. Um, you can do more fancier stuff. Like, if you, for example, for image net, you know that there is a hierarchy of the classes of the labels.

There is a structure so you can use that to decide how to put uncertainty, right? You can say everything that's within the same kind of thing. They, they get a lot more, and the rest is a lot zero, so you can do this kind of tricks. They help a little bit.

I don't have them in the slides, but there are tricks that people do Envision, um.

It's not solving all problems, but it does help, you know, people were using them quite a bit. I don't think that you used as much anymore. So right now, people use that data augmentation that helps a lot more. Um, and, and this just this label smoothing feels like extra work that people don't do it.

Well, that's that's pretty good, um? I had some questions as well. I'll try to go a little bit through them because I wasn't sure whether people have questions might not be more technical so, but let's see if we can. So I, I, yeah, I just wanna like, I think I've said this a few times, but like the first question that I had, uh, for people was?

In deep learning. Why is this emphasis over time? Why is the? You know what? Why? Why do people emphasize the sort of deep architectures like the multiple layers versus shallow? I think we just talked about it. Does anyone want to summarize basically? Interactive bit, introduce an inductive bias. Yeah, exactly because that adds this inductive bias, and it's sort of a very empirical things.

Like, if you train deep networks like the well, it's there's a point where it doesn't help anymore, but actually making the initially making the model deeper helps you in terms of performance at some point. If it's too deep, it doesn't. Really help anymore. And um. We're going to talk maybe today about this, but like the the issue is.

As you make the model deeper, you make optimization harder. So, at some point, if you make it too deep, you just can't lie in hand and nothing works anymore. The second one started to see to make sure that everyone understand what the universal function approximator is, so we know that neural networks are deep.

Networks are Universal function approximators, so I wrote here a little piece of code. Can a neural network? Learn this to imitate this code. Exactly, yes. Yes, so, so they're just approximating functions. This is not a function because it has an internal state, so every time you call it like the output is different, even if you call it the same input.

So, so there are architectures that can do this. Networks and so forth. And this kind of moves towards turing machine, and this kind of. Yeah, sorry, yeah, for RNN. And I don't know how specific number of steps right? Uh. So RNs in. In principle, you can run them for an Infinite number of steps, so you can think of them as a dynamical system.

In practice, we usually have a fixed length, which is given by the sequence line. Was that a question or something else? Yes, that's my question. I can't get it like. This function can run theoretically for infinite number of steps. Yes, but RNs can can reach this because of the training limitation, so I can't approximately.

No, the influence the one stops you to to train them. Oh, okay, okay. So, so your question is, um? In order to learn this function, you need to have an infinite sequence Center to see that to make sure. Yes, that is true, like whether, um? I mean, okay. So, for this particular example, I would.

My path would be that if you give sequences of, say, length 10, oh yeah, let's already figure out what the underlying function is, because it's kind of simple and the inductive bias of the RNN, which kind of looks like this. So, you just have some weights to these two things.

When making that, so this is very aligned to an RNN, so it will discover. But in general, if you have any kind of dynamical system. Um, unless You observe the behavior of dynamical system of over an infinite long sequence, there is always some uncertainty like the dynamic system can always do well.

If step is larger than three trillions, then do something completely different and then the other, and then obviously never observes that. So it doesn't know that it has to have this switching Behavior at lecture. Okay. So? This is another question that someone asked me yesterday, so I said yesterday when we were talking about linear regions that this is a bit like origami, but it's not exactly origami.

Do you know, yeah? Why is that the case? Yes, is it nicer to me? Yeah, because origami doesn't collapse it to afflict set point rather than just Paul. Yes, so you know that I said this pathology that they can collapse an entire part of the space, the signal point when they fall, and this is bad.

Uh, the second question is, what can go wrong with the salesimaps? I try to explain this yesterday. I don't know if I've managed, but I want to know if anyone can. See if people understood why, what was like how, how saliency Maps could go wrong? Uh, or is kind of sensitivity analysis gradient based.

Yes.

In front.

Yes, so it is. So, the issue is that, yeah, the sincere analysis is just a first order approximation. So, somehow, the other because of the other higher order terms, this might be misleading, so you could have. Like I can do, I can draw different people than the ones I was drawing yesterday.

Can you imagine? Can imagine a function that looks. A bit like this. Right, then you are here. So you if you compute the sensitivity with respect to this X because here it looks quite flat. You'd say, okay, it doesn't matter. X. It doesn't matter, I can move as much as I want.

The function is not going to change, but then you know as soon as you move a little bit. Things start changing, right? And this is because there's like, third, order fourth order, what? Not so, the curvature changes immediately as you start moving like, you can make it even worse.

You can make it like a b. So, in that scenario, in a scenario like this? You would believe that. Okay, it does not depend on X, but it actually depends quite heavily on X when you start perturbing the model. Um, so another trick that people do sometimes when they do this kind of analysis to to try to understand.

Uh, or to interpret the decision by the model is, you can use perturbations rather than than doing another instead of computing gradients. The gradient is the analytical form. And it breaks down in the scenarios like this. If instead, I look at. F of. And f of x plus Epsilon.

Where I sample a bunch of epsilons? This is a bit more robust. You know, this is like, you know, you will see, for example, here, because what this does is basically approximates this this function with something a bit smoother. And, and therefore, you're gonna capture some of the the change.

So, so there's like, one trick that you can do if you want to be more confident in your predictions, um of whether that variable is important or not? Um. Yeah, I think we talked about yesterday. Can we just delete this? This is my last question is, we talked about visualizing filters.

Um. I didn't talk that much about this. I, I don't know if people actually follow through, but. You know when you visualize filter? Like, let's just look at the first layer, because on on other layers becomes more complicated, but I'm on the first layer. I'm assuming that each unit is?

Feature detector, so what it does? It takes the input X the compos computer cosine distance with, like a column in a weight Matrix. So that is your template. And then, if it's high, then the activation function. Um, fires. If it's low, it means it's not a match. And if you have a value, for example, it goes to zero.

Uh, so therefore looking at these Columns of the weights the templates are comparing to is a way of understanding what the unit does right to visualize this template, and it looks like some object and say, okay, I'm detecting that object, so similar to the sensory analysis. Can you construct, or like, how can this go wrong?

Like, when can you look at this filter and say, oh, this is supposed to detect the picture of a cat, and it actually does not do that. Does anyone have an idea of how they you can break this out? I mean, the whole point here is that you can break any kind of analysis you do.

If this is not going to happen in practice, the CNC Maps is not going to happen in practice that much either, but you know that technically, you can do it mathematically. You can do it, but how can you mathematically break this one? Does anyone have any idea?

What are you not looking at when you're just looking at the weights? Why is it exactly? Sorry, what did he says so, like? One thing you can do, for example, is, can have a bias that's minus a thousand, so that unit is always zero. Like, you can look at it, and you can argue, oh, this is detecting.

I don't know this particular pattern, and therefore, you know, it's going to be used upwards when it does things. But if the bias is? Very negative, and you have a relu, then basically that unit is not used, and you can't tell whether a unit basically it just because it's sort of like a.

Because you start at the bottom going upwards. When you do this kind of visualization, you never check whether you can check what this this unit is Computing, but you don't check whether the rest of the network is actually using that unit. So, therefore, if you start making a hypothesis that is using this unit to do X.

That might be wrong, like the unit might be trying to check whether X is in the image, but doesn't mean that the rest of the network is going to use it to do anything for its prediction. And it happened in practice. If this can happen in practice, yes? Usually, it shouldn't necessarily happen.

In practice, it can happen in practice. If you mess with the optimization process the wrong way, like, for example, you have a strong, uh, like L2 regularization that you apply on the hidden units or something like this, and the model decides that the easiest way to? Uh, minimize that regularizer is to to do the biot, make devices negative, or something like that, like you could.

You could like it could happen, but it's unlikely depending if you have a standard pipeline that is student, and everything is nice here. Um, I don't understand the emphasis on the first layer because? Oh, it's just because it's a bit conceptually a bit easier. So when you go to the next layer with for this kind of filter visualization, usually what you do usually take the weight.

You take the inverse of the activation function and usually activation functions don't have an inverse, so you take something that you approximate as an inverse. So, for example, for relu, the inverse is relative. That's what people typically use, which makes no sense because it's not. But anyway, you take the weight you apply relu, and then you multiply with the with the weights and the layer below to kind of compose.

So easy to collapse. This whole thing to a linear model and then visualize what that looks. And. Yeah, so I put the first layer because I didn't want to talk about this inverse, that I think is wrong. But this is what people used to do. Maps is the most standard thing that people know nowadays in the early 2000s.

This is what almost everyone was doing, looking at these filters and and then trying to. Sort of approximate. The activation function with sort of the expected behavior or some inverse or something like that and then and then get this to be like a linear model and sort of collapse it into.

Into again, sort of like a linear map, and then look at the weight visualize it. Okay. Let me. So today, the idea was, we talked a little bit about representations and and sort of how you parameterize a model. And today, we wanted to talk about learning. And optimization. Um, so, um.

Maybe I, so this is a slide that presented in the past before, so this is not meant to be necessarily correct. Mathematically, it's more meant to be very kind of, like, illustrative of what's going on. I'm just gonna try to explain it in words because maybe words are way better.

Um, but the whole point I'm trying to make here is that there is a distinction between optimization and learning and generalization and and. Um, ml mostly cares about generalization and learning. Uh, but you know? Optimization is more the thing you can formalize and the thing that there's a lot of theory for.

So, what is the difference between these two. So ideally, when you have some loss, some loss, which is just a distance between the output of your model and some label that you have. What you care about is. Minimizing this loss, so minimizing this distance. Over the entire domain, right?

So, like, if you're doing this for images you care about over the entire domain, and this is sort of. What learning and generalization is trying? Generalization is all about, so it's it's not about it's. It's about making sure. That this function H matches the underlying process that generates the label y, right.

So, basically, you want to have good predictions for all possible images. But what, you know, Computing this integral is intractable. We don't have it. We don't have the data for it. You know, you don't have all possible images that ever existed. So, what we can do instead is we can approximate this integral by a sum.

So what we have is you have a bunch of samples? And what we can do is you can minimize the distance. Between the model output and the labels on this data set that we have, right? But this is different from what we care about. Like, we don't care about.

We, we already have the labels for these data points, so the fact that we get a function that gives me the correct label for a data point that I have in a training set. It's not useful, like I already had the light bulb. The only thing I hear is how well is this function responding to data that I do not have, right?

So, I care about this integral. I care about the generation part, but the only thing I can work with is the optimization part. Um and. We will talk a little bit of how? Optimization can help you reach some of your goal in terms of generalizing. But the first part of what I want you to present today was the focus on the optimization.

So for now, like this is what we really care about, but for now, let's ask the question if I have some function. H does parametrize by Theta? And I have a bunch of data points x i y i. How do I basically? Find the Theta size that this distance is minimized on on the training set.

So, if I have some random loss, so this is the last compute on the entire training set. And looks like this. Like, how would you go about finding the Minima of this function? Anyone has any idea? Got 20 cents? Yes. So, how would? How does gradient descent actually work so?

You know, the first thing that you that that you see, is that, like, okay, this looks very ugly, and there is no way of having some kind of close form solution to it. There's there's nothing there to to do that. So, what you have to do is you have to do some kind of iterative search to find your minimum.

So, how does gradient descent works? Um, and I'm again ignore the math, but I think this intuition to me at this is quite useful. So, what you do when you look agent is sent, you do the following. You say, I have no idea how to optimize this function. But I know how to optimize linear functions or even quadratic functions for those.

I have close form Solutions. I know how they work. And, and I know that any function can be approximated by a Taylor expansion right, and the Taylor expansion goes from linear quadratic. And you know, or third time, and so forth. So, what I'm going to do is, I'm going to pick some points.

I'm random point where I am. I'm gonna say, let me approximate the function. I'm trying to minimize by a linear function. Uh, so, so the the red dotted line is linear function. I'm approximating my. And now, the nice thing is. I know where the Minima of this linear function is.

Because the linear function, I know how to minimize it. The issue is that the minimize somewhere at Infinity, either plus or minus infinity, depending on how this looks so it's not? It's not that informative because it's probably it's not the minimum of the of the black of the. Of the initial function I wanted to to to to minimize the the black line here, but the reason it's not matching is because my data expansion is only true locally.

So, then, what I'm gonna say is gonna say, I'm gonna construct the trust region. So, I'm going to say, okay, I can only look within this circle. And then, I'm gonna ask myself, what is the Minima of this red line that I know how to minimize within the region where I think this approximation actually is reliable.

Um, and you can write this down and this is turned out to be. Um. Um. A constrained optimization. That will look something like, minimize. You know, this is the first order tailor. So, this is my linearization of the of the function minimize L Theta plus. Uh, Delta, times the gradient.

Such that the Delta, the thing that I'm minimizing for the step that I'm minimizing for, is small enough that this approximation is actually reliable. And then the way you solve this constraint optimization is you use LaGrange multiplier to take the constraint and push it back into the equation. As I said, the math maybe is not that important, but probably many of you know how the garage multiplier works, but anyway?

So the way you do that you use learners multipliers to move this here, but you don't. You don't care about. Efficient, right? You, you treat that as a as a hyper parameter. So, what you get when you push this up is you get that. The update is actually just gradient descent.

Because you know, this is like when you push this up. This is a this is the second order term, so this becomes a quadratic. Uh, with the identity as the coefficient for the second order term. So when you're trying to solve that you, you basically just gradient. You just get a gradient, so you take this formula.

You take the gradient of it, and you check to see when the gradient of it is zero, and then what you get out of. It is sort of you are your your usual update, where you have the learning rate, which comes from this LaGrange multiplier thing. Uh, does that make any sense in terms of this works?

Or, like, intuitively, what you're doing so when you're doing this iterative map. Like, right here in the center, any kind of iterative optimization. What you're really doing is, you are starting somewhere. You're making a local approximation of your function to something similar that you can solve, and you're restricting yourself to the area where this approximation holds.

And then, you're solving this approximation rather than solving your problem. Within that area, you take that step. You get there. And then, you repeat this process, and you keep going like this. This is kind of the idea of most of these iterative searches.

So this is the same thing. So here's the math a little bit sort of more like spelled out, so you had this term. This terms becomes which you can write it as a Theta squared. I'm kind of ignoring I'm kind of. Being very loose with, like a proper metrics, notation and stuff like that.

And just like, just to make it look a little bit simpler. So, how do you solve? This, uh, so once I put my constraint into the main equation, I get sort of this this equation, that's of a second order equation. How do I solve this? I take the gradient of it and set it to zero.

And I basically sort of get the the update I was looking for, and this gives me the the standard gradient descent update that everyone knows where. Like, at every step, I I move in the direction, the opposite direction of the gradient, right. I take. Uh, and decide just sort of the steps of driving that.

Are there any questions before I jump to the?

Yeah, so you said we took the constraint and put it back in there in this equation, and the Costa Rica is the the normal that Theta is actually less able to. Yeah. So how do we? How did we actually reflect it in the first question? Right. So, like, I multiply.

What it says, is that the solution of this constraint problem is equivalent with the solution of this? Formula where you just add the constraint weighted by coefficient. This is the LaGrange coefficient. This ETA that you end up here. So, this Norm of theta is this term here. And this boosting is sort of the coefficient that comes from.

Um, so this is how you solve sort of constrained optimization, so you have. I think it's for equality, and then if you have inequalities, there's sort of the extended form. Well, anyway. I don't remember how they have a particular name, but in the end, you end up with the same thing.

Is you basically just move your constraints within the equation, but you end up waiting for them and typically what you do if you want to solve this as a. Properly is there is a way for optimizing for this coefficient. Outside. So, you need to. You need to self solve for the coefficient and then plug in the right coefficient here.

And then, when you solve this, you get the exact proper solution, what we do when we're doing. Machine learning is. We're not. We don't care that much. This is, we don't have that much to properly solve this. So, what we're saying is that we're gonna replace this ETA by some constant.

Uh, because? In some sense. Like we. We don't even know what Epsilon should be. Let's put it this way. Like, we, we don't know, like how big this area needs to be, such that my first order still holds. We just know it has to be small step. That's, I mean, you can compute it.

I mean, this is all about sort of adding curvature and all of this stuff into the equation and trying to get a sense of how quickly my loss surface changes to get a sense of what is the the area where I'm gonna trust my approximation? But for now, we're talking about that.

So we, we just randomly pick an Epsilon because we don't necessarily know what you know, because we don't care about what Epsilon is, except that it's very small. We then as well here. Don't don't care about. Solving for the LaGrange multipliers. We just take this form and we just replace this ETA that further on.

It's going to become gamma because. You know, one over two ETA. We're gonna replace this by the lining, right? And then we're. Hypertune for lining, right? All I know about lining rate is the steps should not be too large, because if they're too large, then basically, what happens? You're not respecting this trust region here and?

Moving outside of? Where the approximation hold? Um, but yeah. So, basically, we'll press that this one by the Norm's point. Yeah, yeah. So, this constraint becomes just, uh, yeah, I should have put a square here because you don't want the dealer square roots and stuff like that. Sorry, like, typically, um.

When you say Norms, we always means Norm squared. Okay, so that we don't need to worry about the square root. It actually does not matter what Norm you pick. You can use whatever Norm in this in. In some sense, you know, as long as you're restricting your step size, that's all that matters.

Okay. So, in order to be able to apply gradient descent, so do this all of this. What we need is we need a gradient of the loss with respect to the parameters. Um. And a way to compute. This is the backdrop algorithm, right, that probably everyone knows and the way the background back propagation algorithm works.

You started the loss. Yeah, the top of the loss. And then you apply the chain rule, right? You compute the gradient of the loss with respect to the output and then multiply this with the Jacobian of the output with respect the previous line and so forth, um. And. At some point, I forgot when there used to be this discussion on Twitter, where Jurgen was trying to claim that backdrop is just a chain rule.

And it's from the 1700s. So, I guess one question is, is this just a chain roll, or is there anything special about back problem because this is a proper paper right in the 90s. I think it's a paper from Jeff Hinton and others that introduced a back propagation algorithm.

There's then afterwards, a paper about back propagation through time algorithm, and these are, you know, those kinds of papers like thousands and thousands of citations, and so forth. So, is this just a chain rule that we knew for hundreds of years, or is there anything special about it? Like, what do people think?

I think there's a joke.

Yeah, so the think about back propagation. So, just to defend the paper and Jeff and the others. So, there is a trick here, which is. Um. Backrogation is about engineering, so the idea in backropagation is that it gives you a particular way of computing the gradients. That leads to minimal flops and memory usage.

Like the and, and basically, the trick for a simple MLP the trick. It's kind of easy to understand is the loss is always a scalar. So, when I'm Computing the the gradient of the loss of respect to the output, this would be a vector. And then. From then onward, I have Vector Matrix multiplication, which will result into a vector, so the state that I have to carry on as I go back to the architecture is always a vector.

If I start from the bottom to apply the chain rule, the general does not have an ordering. The general is just well. I can start in the middle and go anywhere, I want. But if I do that, I have to keep around Matrix Jacobians. They're actually quite large. And.

The computer has to spend when you do a matrix Matrix multiplication and sort of the memory that you have to keep is much, much higher than if you do a vector Matrix. So, this is. Basically the. The main thing about back propagation is, it is not? It is the chain rule.

I mean, there's nothing else you can do. You're trying to copy the gradients, but it is a way of arranging your computation to make it quite efficient. And when you have a what's happening? Yeah, about that.

Okay, okay, so when you're? And machine learning. It's all about how do you make these things run efficiently? I mean, that's what made machine learning work. So, so that is the secret. And when you have more complicated, uh, computational graphs, like, for example, you have skip connections. You have recurrencies.

You have all this kind of stuff. Um, it becomes even more apparent that there is a particular strategy of how you're supposed to computer gradient how you're supposed to go through it. That will lead to minimal memory and and and flops. I have a question. Yes, you said a question, but like, this is in 90s.

Yeah, around 50 years, why the one introduced anything that used substantial Improvement that through that propagation? Is it because the bike propagation is already complete and there is no room for improvement? Okay, to just take? For granted, uh, in, yeah, so I mean. Your meaning. Okay, so first off, let me ask the question.

So, you mean improvements in how you compute the grades or improving our Hardware and optimization, like kernel wise, but you are not doing anything related to the algorithm itself. So, in terms of how you compute that gradient. There's no many other options, so this is, in some sense, if your goal is memory and and computer efficiency just to computer grade.

I mean, for optimization, there are other algorithms instead of gradient descent. Like, you can do other things, but just sort of the question of how you compute the gradient. Uh, if the goal is minimal compute and minimal memory, uh, backdrop is the optimal thing you can do. There is no other ordering of your gradients that would give you something better.

There is another thing that people do. For various reasons. I don't think I have the slice for it, but we have what people call forward propagation, which is, you start from the input and go all the way to the top. When you compute a gradient. Um, and this is useful.

In different scenarios. So? I'm trying to think of a okay. I'll give you the standard scenario, but it's a bit sort of out there, but the center scenario is for recurring neural networks and the point for economyo networks is when you do back prop. Uh, you can imagine that you have to go in time all the way to the end.

And then, you start at the end computer gradients, and this is biologically not plausible, and also this makes things quite quite slow because you have to wait. If you have a very long sequence. You have to wait until the end of the sequence. If you start from the input, well, it's nice for recovering your networks is that you can compute the gradient and forward at the same time, and you just go forward in time, and that's it.

So this is the memory. Consumption is high is, um? And Cube instead of n or n Square to something like that. Anyway, the memory consumption is pretty pretty high. And people don't actually use this that much, but it is biologically plausible because you're Computing both signals at the same time, which I mean, it's not fully biological plausible because gradients are not values, but at least it doesn't require any move backwards in time.

It's always forward. And it can be useful in all kinds of weird settings. So, this is called real-time recurrent learning rtrl. It kind of it was big. In the 90s, it went out of fashion, and now it's been revived in some niche papers, but it's not sort of mainstream.

Um, and the reason is that mainstream is because the memory consumption is high. And is typically backdrop. Two times still works the best. If you are just like. With the hydride that we have with everything it just waste the best. But yeah, I mean, I don't think there is much room to improve how your computer gradient.

There is room in. What do you do in the gradient, like, instead of ingredient descent, doing sort of like fancier stuff on top of it? Yes, sorry about you saying that you, you just do forward pass. Um, like you mean that you do forward pass and backward person at the same time.

Yeah, okay, yeah, yeah, because you start at the bottom, right? So you like, even for for a few dollar models like, you don't need to go all the way up, and then you can try to do your backwards if you do. If you start from the beginning as you compute H and at the same time, you can compute the Jacobia and all A2, resp.

And then you keep this going forward together, right? So you keep accumulating. So whenever you do a forward, you compute both the forward and the jacobia. Which is, for most cases, the Jacobian is in close form, right? So, you, you just? It's a function of X the and and W the jackpot here.

Um. That makes sense, right? So, so you can confuse the copies as you go forward, and you can multiply them as well. It's just that. You always get a matrix Matrix multiplication for the forward for between jacobians. And the the memory footprint is much higher. So, there's the issue is like computationally High memory high, but you don't need to go up and down.

You just go up, and that's it. Yes, this is this is the trade-off. Yeah, so there are reasons why you would want to do this. As I said, because you don't want to do backwards. Um, but usually if? Compute and memory. Is your bottleneck then? Backdrop is the optimal stuff you know.

Oh, oh yeah, this is sorry. I did not know I have this light, but this is exactly a combinational graph of the backwards mode, so this is backdrop back propagation, so you start from the you you have the? So, you should start with X, but here's exactly with H.

You have H you compute H2, you compute H3, and you compute loss. And then, from these two, you can compute the derivative of the loss with respect to H3. And from this, you can compute the derivative of the loss 2 with respect to H2 by just adding the Jacobian, so you compute the Jacobian and then you multiply the Jacobian and then you get, and this is sort of the computational graph that you get to the backdrop.

Here is the forward mode that I was discussing where you compute H1 and then from H1. You can, uh, and you compute HR and compute H2, so there should be an arrow here as well, so may challenge to you can compute the Jacobian so down. From then onward, you kind of just keep going down and compute them in parallel.

Um. So? Um, yeah. This is sort of the. What's most popular ways of of computing gradients backwards and forwards? And um? This is actually. Yeah, I mean. You know, we're so used to have these libraries. I'm sorry, I'm just. Uh, okay, so I'll we'll try to get a break in five minutes.

I'm just gonna try to finish my. So, we're, we're so used of using something like python type, bread, or whatnot of the things. So, uh? All of these libraries, the way they work. It's exactly like this, so. I mean, yeah, so then you have a sense of what's going on.

So usually, when you build an architecture. The all of these libraries, where they try to build under the hood, is a computational graph that tells you how the computations go. And then, when you call grad, they just take the computational graph, and then they construct the computational graph of the gradient, and that usually is done by reverting, you know, starting from the end and inverting the errors and you for each the way pytorch Works in specifically for every layer.

Uh, basically, it has defined in close form of the Jacobian of the layer is. So when, when it needs to do the backwards, it just sort of goes through the computational graph, you know. And then, for each layer. It knows what the Jacobian R and computes the Jacobian and construct the graph that way.

So, so usually, um? The main sort of data structure that's behind all of this library is some kind of graph, which is the computational graph and most of these automatic differentiation and some other stuff that you that you do. They're all just sort of manipulations of this graph, based on on specific roles.

So this is how these things work. And then sort of the. The hardest job with this stuff. Uh, which I think nowadays new libraries that don't even attempting to do as much is to try to optimize the graph. Like, I remember in the early days when I was working on this, so I, I spent doing my PhD sometimes on piano, which was one of this first libraries to build things.

One thing that we were worried about is you get a graph like this or more complicated, and then you want to check whether the graph is computationally optimal. So, whether like you can, like, do a different graph that has fewer nodes or fewer edges, because somehow you know you can reuse some of the computer?

And um? Most of the times. The gradually we have. There are not computationally efficient. We're not using compute, so there is a lot of room to optimize these graphs, but like to do that automatically for the user without it's. It's actually really hard, and Tiano tried to do this, and we ended up, so we had, uh, like one of these nodes, was to do recurrences.

It had the op scan, and it ended up that when you had a program that had the scan, you had to wait like 15 to 20 minutes for the code to compile, so to say so that you can actually run it because it was trying to optimize the graph and most of the time.

It wasn't able to do anything useful. Um. But nowadays, I get sort of why people do is when they identify some inefficiency and how computations happen. They build a care now, so they build a specialized layer for that particular thing. And I think this is a much more scalable way of dealing with it, but um.

Yeah, in principle, this is sort of the. The main thing behind this libraries is right. Building the occupational graph, using that to to generate your gradients and other kinds of derivatives or other kinds of things you might care about, like Computing, hessions, and whatnot. It's the same. It's like you can use this graph to to build sort of the graph to computer Hessian, and uh, then manipulation on this computational graph to make it faster.

This is kind of the the bread and butter for for things like pythonic index and so forth. Um, okay, let's take a short. Oh, there's a question, and then we can take a short break. When doing it, I think there is.

We talked about is? Radiant descent, which is iterative process where you are linearizing the system, or you're taking a personal retail expansion of your model, and then you're restricting yourself to some kind of trust region where you assume this. Uh, this linearization holds. And what gives you what this gives you is a constrained optimization where you're trying to minimize the simpler function.

The the approximation that you've made within a constraint on how much you can move? You typically use LaGrange multipliers to deal with the constrained problem to to transform it into an unconstrained optimization Problem by moving the constraint within the equation. This becomes the second order equation that you can solve in close form and what you get out of it.

Is your typical gradient descent step, right, which says move in the opposite direction of the gradient step. But this is the intuition of what's going on is you're. You're solving this problem that you don't know by taking like by making many local approximations and solving the local approximations to move forward in all of this process.

In order to, to do this, you need to have the gradient, the way we compute the gradient because neural networks are just sort of a big composition of functions.

The chain rule in order to minimize memory and compute. And, and that's sort of, what the back propagation algorithm is all about, and in. Auto diff world. Maybe, uh, it's worth mentioning. So, again, there is also like, in some sense, you know, back propagation. There is a paper and everyone celebrated and whatnot, and it is sort of how people in machine learning refer to the algorithm in Autodif world, which is a different Community, a different field.

They would call this backward mode, and the other one would be the forward mode. And I think this probably existed before back propagation. People did not know this. It just that's when people in machine learning realized that they need to do their Synology to save money. Things. If you look across Fields, you know, things get reinvented all the time.

And in different ways. So this is backropagation, and on the next slide. So this is that is an exercise, but it's just something we're gonna work through is notice in an exercise. It just sort of like, generally, you know. You can see what happens if you do backwards and and, and um, when you have a much more complicated computational graph, so you have this tag where you have, uh, from H1?

Computing this to different embeddings H2 and H3, and sometimes you can put the loss and then you know to compute this. What you can do is you can take the Brute Force approach, which is, you enumerate all different Pathways that go from the loss to the input. So, in this particular case, you have two Pathways, one that goes through H2, the other one that goes through H3.

Path independently, and then you add the terms. So this is what in this exercise is called Brute Force. This is not what back propagation algorithm would do. This Brute Force. I mean, it doesn't look like it here, but it is inefficient. Like, if you have a much more complicated dark with a lot more branching, like, just enumerating all possible paths and doing each pass individually leads to a lot of wasted compute.

What backpack would do is basically this algorithm which what it does is basically. Pathways have a common node. You basically collapse them together, so you, you start both Pathways in parallel. You compute the gradient of the loss to H2 the gradient of the loss of H3, and then at this step, because these two Pathways going to the same node.

You sum them, and then from this, so. So, in this particular case, this only saves you one extra Jacobian multiplication. So, if you look at the computations that you have to do is you have to do one last message multiplication, which doesn't seem like much. But when you have a complicated dag, if you do this.

Nation as soon as you can. When you have a node if multiple ages going out of it. You're gonna end up saving a lot of computer memory and then sort of. Fundamentally, the backdrop algorithm, and it's usually framed or some kind of like Dynamic programming algorithm, which gives you sort of the most optimal way of decomposing things.

So? Yeah, I mean, this is the extent that I'm gonna talk about back propagation, but maybe in the homework. Oh, this has reminded me, um, I will. I think today we will try to send the first homework out for people to to start looking at it. It's gonna be due next week on Wednesday, so there should be plenty of time.

But. Some exercises are a bit more open-ended, so I thought the more time you have the better and just to kind of see how, how that goes. Okay, so back from. Okay, so we're still at the back propagation algorithm, so this is a typical neural network. This is the gradient if I take a simple MLP, I also realize that I never mentioned these things are called MLPs.

I think everyone knows that are called them one piece. But if I take a scene, simple MLP MLP stands for multi-layer by subtron, but that's sort of your standard neural networking for Network. I, I write down the chain rule written sort of in the way backdrop would do it from the loss to the beginning.

I get the series of jacopians, so in. When you're doing effect, prep on MLP. You basically have an alternation between two types of jacobians, right? It's the Jacobias that correspond to a linear projection. And usually the the Jacobian of this is just the weight transpose. They should be a transpose there.

But anyway, there's a transpose when you come to the Japan. You have the Jacobian of the activation functions. The activation functions are typically diagonal because the activation function is alignment wise operation. So, the Jacobian itself is diagonal, so you don't actually when you when you multiply with these things, you don't actually need to do a matrix Vector multiplication.

You do an lmy's Vector multiplication, which is kind of faster. But the point that I wanted to make is. You end up with a lot of dacobians that are the Jacobian of the activation function, um? That's sort of. So your choice of activation function matters like we haven't talked that much.

We talked about relu from the point of view of expressivity and what you can do with it, but I haven't talked that much of how do you end up picking an activation function, what you're looking for. So, this is the. Sigma evacuation function, which used to be the default for neural networks and then the level, the one we've discussed and in blue.

You have the um, the forward the the function itself and with the dotted yellow lines. You have the the Jacobian, the derivative of this function, right to see how it looks. And one thing that you notice, for example, for sigmoid. Is that the derivative is actually quite low? So, when you have a very deep model?

And you have to multiply many of these jacobians. You can see that. You know, they can only do one thing which is shrink. The norm of your signal? Um, and then you can bound how fast it shrinks the the norm of your of your signal. You can kind of get a sense of how quickly your signal will vanish, but overall.

That is sort of the reason why sigmoids are not in fashion anymore is because sort of their properties. When you look at the backwards. They're not very nice because what they do is they have this very fast. Um. They, they shrink the, the, the norm of your gradients very fast.

And, um, basically, once your gradients disappear as you go through the chain? You, you stop learning. Because if you don't have a gradient that connects your loss to the input, it means that you cannot find out what is the relationship between your input and the loss, so you cannot really learn anything right?

You cannot learn what is? Y given X? Um, on the other hand. Relo has its own pathology, which is, while detective, the gradient is one, so there's no, no vanishing of the signal. But if the value is not active, the gradient vanishes immediately. Initially. People are worried about travel because it has this region of zeros, so there were.

He was, if I have a model, I initialize it with values. What happens if suddenly in a given layer? All my hidden units are dead are in the zero region. Then there's no signal flowing through. Then there's no learning happening, and my model is. This one is maybe how people were thinking about this or in the early days.

What happens in practice is that that's a very unlikely event. Actually using the standard internalization and so forth. That almost never happens. And in general. Like, if you really want to approximate how relu is what is doing to the backward signal, you might say that relu like multiplies the signal with 0.5 to kind of take an expectation, or you say half of the time is 0 half of the time, and so on.

It's maybe like scaling it down. It's not really 0.5, but anyway, it? It scales down the signal much. Not as aggressive as sigmoid, and that's sort of what. What kind of matters in this thing? Question yes, when does it happen then? When does it become zero? When does it cost the the rather one is a zero when the input is negative?

So, in practice, like, when does that happen because you said there's almost never happens? Oh, um.

It is in both States quite often, like. Actually. So, so you have two states for a unit that are very problematic when you have a value. One is when the unit is always zero, so then the unit is that, and you're basically losing capacity even though you have those weights.

Like, you're not participating in the model at all the other time. When everything is problematic is, the unit is always on. If you have a layer and the unit of the layer, I always on. Basically, you have. You don't have an activation function. You have two linear projections one after the other.

So, again, you're losing capacity. So the optimal use of your activation function is a 50 of the time. Is that and 50 percent of time is active because that's when you actually have access to non-linear behavior from that unit? Uh, so you don't want to be on any of these sites.

Uh, there was another question, yes. And can we just use the leaky relu to be saved from the zero problem? Yeah, definitely. You can use the I mean, and people nowadays don't don't use Railway anymore, right? We have jet glue. Whatever all of this variants, they're all very similar.

But yeah, some of the things that they do, is they soften this this area, and sometimes they have a quick self? But yeah, you can definitely do that. I'm just saying in practice relu is not that big of a problem either, so you can use it safe? Yes, yeah, one of arctified value.

Yeah, and we also use it to actually welcome this issues.

Is the switch that is the gear glue. There is the Leaky relu, and then there's a bunch of at some point. It used to be very popular to come up with activation functions. It kind of died out. There is a an argument that I'm going to talk about a bit later about JetBlue, or or basically, so they call it an activation function.

I wouldn't call it an activation function. It means that's an activation function, but they have this thing where you Branch out, you have, you know, it's like a it's a it's an activation function with parameters you Branch out. You have two different projections, and then you multiply them together, where on one side you have gel low activation function, and the other side is linear.

So, um? In Transformers, though it's being. Now, replaced Again by a different thing, but it used to be the default. And that used to work way better than all the others, but it worked better because you have this sort of gating Behavior. This multiplication behavior, not because of the activation function itself, which in the end was sort of very similar to all of this.

It was jello, which is. Version of Bello. So, so this doesn't matter. So, um. But. In general, most activation functions are a variation on this shape. Um, I think there is still sort of. Open question of what is, like, an optimal activation function? Um, and I think there is still room.

To come up with new activation functions, but I think. The way I would do it if I would have to do research on this topic or if I would try to do research on this topic is, I think there is enough evidence that we didn't domain. Um. All activation functions at some point kind of start behaving the same, particularly if you have sufficient scale, and you optimize your model properly.

So, if you use, like? Not a super fancy Optimizer, but a decent Optimizer that can forg for some issues of the curvature. Like, add them, you know, like, Adam, W, or whatnot? The version of that then, um? And and? You know, you allow yourself to scale as much as you need?

Um, you'll not see any difference between activation functions, but another way to think about activation functions is as part of the architecture. They Define the inductive bias of the architecture. So, the fact that you use value or something else my biases might bias you towards certain kinds of solutions.

We not necessarily know how to describe this mathematically, but there definitely. It's part of your exactly bias, so if I would do, uh, like architecture, search to find new activation functions or anything like that. I would look for activation functions that behave differently out of domain. It's a bit of a lost cause at the moment to come up with the righteovation function that does better than the ones that exist.

I think they all kind of do the same. There is one space where this still matters, which is, uh, from what I understood, for example, a Gemini scale when they do the extra large models. Because everything is so expensive. Like, if you have an activation function that? You know, it's a few less flops than the other, but kind of behaves the same.

It still kind of matters to them because you know, like flops, matter a lot when you have these huge systems or something that helps you on the hardware somehow. Then, sure, I, I would buy that. But if you're like a normal researcher where you're not thinking about this extreme life scale scenarios that I don't think it matters.

Which activation function you use, like, any of them are, is fine? Looking for out of domain? I still think that. The choice of activation function can make a huge difference. Because they they sort of shape. The inductive bias of your model, and I think that's more of a. Reliable space to look for Activation functions.

But yeah, I mean, for, for my lecture, I'll mostly talk about relu because I find it simple, but almost anything that I say. You can replace it with leaky relu or whatnot, and it's kind of the same story.

Uh, what? I was talking about the extension I was talking. Sorry. When I was saying about the extreme scale about flops, isn't that about outliers. It's about, like, making things sort of cheaper, so some of these activation functions might involve say the X function and expect, actually, not that cheap to compute compared to squaring or multiplication.

So, if you can get the same effect or the same kind of behavior with an activation function that uses simpler operations or fewer operations. Atomic operations, then, that can be a big win at when I say extreme, it's all about scale and compute, right? Um, like the?

Uh, so you know anything that you can save there. It's, you can. It can mean a big thing. Um, but even that I don't think is that active, even there. I think people are pretty traditional in their choice of activation function, because it's also like a place where you don't want to take too many chances like you might have this activation function that looks the same as a relu, or like the jaglu or whatever is your activation of choice.

Are not 100 sure. So, like, it's not worth it. Taking that risk, you know, if you have your pipeline, everything is there. You just kind of go with what's there. Um. So? The other thing I wanted to say is that? When you think about optimization and you think about gradients and all of this thing, like, one thing that's important in neural networks is, like, so far, in the talk I've been.

Presenting things from a functional point of view, right? You have a function in computer gradient that function you can go with it, but like, what's really important is to understand how the signal propagates through the network. I mean, I think this previous slide is exactly about that, right? So, like, if you have the wrong activation function as you go backwards through your network from the loss to the input.

The signal might vanish at some point because of the activation function or something else. To a big extent. This is. Maybe one of the main tools that you have to debug architectures, like looking at how the signal propagates with the architecture, is actually quite useful and quite important. And um?

Of. This was one of the big breakthrough for for deep learning. So, and this is comes back to like things like. They were doing sort of initializations for neural networks, and then you have. Coming in, and others who did sort of other variations. And all of these are about.

Uh, signal propagation, and usually the the kind of questions that they ask is. How should I initialize my model, and how should I use my activation function suching the such that, in expectation, the variance of the input does not vanish. As I go through the architecture, so this is usually the way things are being trained and usually look at a single layer and say, like, okay, I have variance one at the input of this layer.

I want lean expectation, sort of to see variance one or the output, and this is sort of the Machinery that they use. I don't know if.

Is too small, then you know you. You get sort of your gradients Vanishing concentrating around zero as they go through the architecture. If your insulation is too big, you can get your model to be too saturated. And again, when the, when the activation functions are saturated. Signal does not propagate well when the activation is just right.

You sort of kind of if you push a gaussian through you, kind of keep having your gaussian as you go through the layers. This was kind of the the philosophy. I mean, you can look at the papers to see sort of the math and how they derive this, but this is sort of the philosophy of intellization and and activation functions is.

I feel like I'm I'm going through too much math, and that's not useful, but maybe the one thing that I should say, which is kind of important is when they do this kind of derivations. If you look at the math, you sometimes make very strong assumptions. Um, like? Yeah.

So, for example, you assume the 10 age is close to linear function, and you get rid of 10h because you don't know how to deal with that age or things like this. So basically, all of this initializations are the output of, you know, applying this formula that you try to maintain variance and then taking kind of some strong assumptions throughout because you know, there's some mathematical element that maybe can be dealt with, but it's messy.

So usually, machine learning and things I messy, you just say, well, I'm just gonna skip it, and I'm gonna assume this is linear. I'm gonna assume this is independent from that, even though they're not and and kind of go forward.

Or not, and then you apply it and it kind of works well in practice and then sort of. You're good to go. Um, so this is the philosophy of ventralization. But decentralizations. And we're going to come back to this. This internalizations because of the assumptions you're making and the sort of these approximations in the math, but also because of what you're trying to do, which is you're looking at a single layer and you're looking at the signal, not.

Preserving variants in expectation. This is not necessarily the correct thing. This is almost, I mean. By definition, you're looking at the average case scenario. It doesn't look at the worst case scenario so. Any of this in Translation. If you take an MLP and you make it deep enough, say, 100 layers or more, you'll see that always the signal Advantage.

Because you're not looking at the worst case scenario. And and and, and then sort of, like when you make the architecture very deep, that that is sort of what happened so? And then, and the truth is. It's really hard to initialize the network size that the signal does not vanish when you have a very deep architecture.

Yes, is it? The question of initialization itself, the vanishing signature visualization it's? So, if you really want to solve it by initialization? And I think James Martin, so he's like a big personal organization so far has a paper on this, but I have to check because he's been working for it for several years.

I don't know if he put it on archive yet or not, but he's been working for it for several years. He's just that kind of person who never puts things on archive. Uh, it's also like a paper that's really hard to read. It's like 100 pages or so, so that's not something you're going to read.

But like, if you really want to solve your financialization, it turns out because of the non-linearity. There is no really close form solution of how to do this. So, what he does is basically he runs. If I remember correctly, like an optimization, like a layer wise optimization process to find out the right instalation of each layer as you go through the network, which doesn't feel a very practical thing, because if you just need to run like an optimization process to each layer just to insulize the model, so then you start training on something else.

Feels quite wasteful, so you could technically do it, but because of the non-linearity, the installation is very data dependent. It's not sort of something that you can bring close for, and it's probably never going to work very well.

Are sort of a much more pragmatical way of dealing with the problem, and it is sort of the reason why we can train these deep models. Like, without normalization layers, you'll never be able to train something that has 100 layers anything like that. Yes, uh, what was the target when they were coming up with these optimizations?

Like, where they're trying to minimize the variance of calculations across layers for the initializations about, so they were just trying to keep the variance in expectation as you go from one layer to another, so you want if you have a input variance. You want the output variance to be the same.

And this ensures that we're not going to have any value. It doesn't work in practice when we start stacking things. A lot of times, this kind of ensures that through that layer, like if you take the layer independently, you're not going to have Vanishing gradients. Doesn't pan out the right way when you start stacking things.

That's kind of the issue, but because you're doing something that's more like an expectation. You want the variance not to disappear. You're not. You're not looking. So, let me like, basically, the issue is, you're not looking at the worst case scenario. So, this slide? This is, um. Work from Andrew Sachs.

Uh, so Andrew sucks. Likes to look at deep linear models, which is basically just an MLP without activation functions. And that's all silly, because that's just a linear model, but the whole point is that you can make sort of theory around it and, and then sometimes the theory makes sense here.

And here's looking at like, what happens to the signal if you just have a deep linear model, so you remove the activation so you can do math. So, what happens if I put 100 layers of those, and how do I make sure that the signal doesn't vanish, and in this particular case, you have a closed form solution, and that is each of your um.

So, then this is kind of taking care of the worst case scenario. So, no matter what signal you're putting in, the matters can only rotate it. It can never like shrink or expand or anything like that. So, in a in, for a deep linear model, if you, if all your weights are like that, then you don't have any Vanishing gradient, and you can have a thousand layers and you still have radiant flowing and everything is fine.

The issue is that because of the non-linearity and in particular all the non-linearities that we have. Or shrinking the signal, or like they're maintaining it in some region like relu and then shrinking in the other, or like shrinking all over the place. So usually what that means is that your linear layers they need to expand the signal.

If you are the signal not to vanish because you know your activation is shrinking it, but the way that happens is sort of data dependent and all of that. So it's it's kind of complicated to make sure. That's why normalization is such a lot more practical because you don't care why the signal is getting smaller, you're just getting it if you.

If you notice the thing that's too small, you make it big again, and you keep going. And I said.

Or which is? Sometimes when you do that? You're messing up with the optimization process. People don't really care about that, but like, for example, I'm going to talk about this one together. But, for example, for bench Norm. Your, your Computing, the gradient of a different function at every mini batch.

Like, you don't have a function that you're minimizing. Like, if you, if you, if you're being very technical from a mathematical point of view, because because of the renormalizations that you do you, basically, you have a different function at every patch, and somehow you're hoping that this process doesn't diverge, and it's actually going to converge to something.

But, you know, it's one of those things where, if you even look at the paper like, no one even asked the question. Like, is this gonna affect convergence of my Optimizer from a theory point of view? Like, just people don't care about kind of details, yes? Oh, there is digital connections.

Yes, there is digital connections help as well for the signal because. When you have residual connections, if you look at the jacobians, the jacobians will have the form identity plus the Jacobian of the residual path. So, in that sense, you're ensured that the signal will not vanish because it always can go through the skip like, you know that the Jacobian can only expand, it cannot shrink.

You have the opposite direction where the gradients can explode because you have identity plus something that's positive. But yeah, residuals definitely help. I mean, the the? The formula that works that everyone is using is residual plus normalization layers. I mean, Transformers resonates. Everyone has that like, actually, nowadays. I will probably struggle to find an architecture that doesn't have residuals.

It does not have their normalizations, like all of them have. It's sort of by default.

Like, what I'm talking here is about the old world that doesn't exist anymore. Back then, we didn't have residuals and organizations.

Oh this grass, so these graphs. Showing sorry, what is this? So, this one is basically when they use this sort of normal initialization, and they're trying to show so. So, I think the the ones where the relative, uh, where the signal doesn't vanish. So, the one that are OT, so these are the ones that use this orthogonalization of the weights that allowed the signal for every, and this is the depth of the model.

So, 50 layers 100 to 100, 400, 800, and so forth. Oh no, this is the depth. I don't know what is 200 is?

Of the layers, and they show that if you have this orthogonal utilization, basically, all models aren't able to minimize the loss, so the loss goes down. Well, if you don't have this orthogonalization, the, the deeper the model at some point, it just kind of breaks, and there's no loss.

This is just like, uh, sorry. So, it means that orthogonal initialization tends to do better than gaussian instrument. Yes. Linear models. If you go to a Relo model, the orthogonal translation is not going to help you. Because of the non-linearities, this is the under sax work, and he is playing with these deep linear models where you can do things in, and he's just trying to make the point that, like, it's, you have to care about the worst case scenario and, and, you know, like, this kind of schemes that are preserving variants?

And so, the gaussian here would also preserve the variance the same sort of severe formula would do for for the relu.

That's sort of the point here. I mean, the paper is more than that. The paper goes over other stuff as well, but this kind of graphs here is. To give you that sense that, like this, the schemes that we do here. They helped a lot, and that's the reason why, you know, deep learning to fall, but they're useful when you have architectures of like two or three hidden layers.

When you have that, this decentralization is enough. When you go from like three layers to 50 or 100, this is not gonna be enough. Your signal will still vanish. The wrong direction. Um, and, and then this is the point that Andrew is making here. Uh, but obviously, the. The final solution that the community kind of ended up with is that the answer is not fixed the initialization, but as these additional things like skip connections and layer Norms.

Utilization on its own potentially can fix it, but it's really hard. And it's not sort of how you want to approach the problem.

There are no other questions. But let's jump into this so. The next bit. I want to talk about is. Um, so this is how gradient descent works. Um, and you can see that. Depending on your loss at any point in time. So, this trust region that you can have where the approximation holds can be wider or narrower, depending on on sort of how things change so.

If you pick the wrong lining, right? So if you hit the wrong step side, you either end up taking a very long time to reach your minimum because you're taking very small steps. And you could have taken big steps, or if you take two big steps, you're gonna jump over it.

And you know your system is not going to convert, you're gonna have this. Behavior. So? Yeah, so it's like, smaller trust regions means a higher LaGrange and penalty. Low learning rate and small steps, slow convergence, and a bunch of other issues. Larger trust region is low language and penalty, highlighting rate, light steps, and instability and lack of convergence.

So, the question is, can we do anything to figure out what is a good step, like, how do we get a good step? Yeah, I have a question. Can we dynamically like compute the size of the trust region, and then make the learning rate Dynamic with it. Yeah, so this is kind of what this section was.

Kind of going to that direction, so? There's maybe different ways of doing this, but one way to do this and this is sort of kind of. Selfie math is to ask the question how much is my gradient changing? And if my gradient is not changing very fast, that means I can take a large step because things are stable.

If my gradient is changing very fast. That means I need to take small steps so that I can recompute the new Direction and I'm moving in. So it's that kind of the high level hypothesis, and then if you. But I mean, ignore the one over anyway. If you just look at how fast the gradient changes, so you take the limit of, you know, the distance between gradients in some direction?

Once you add Epsilon to Theta, what you get back is the second derivative. This is the definition of the second derivative, and if you want to look at the inverse of this, because this is how much you should know, so you should move a lot if the gradient doesn't change.

If this number is small and you should move less if this number is big, so you take one over it then. What you get back is the inverse of the head shell.

Second order method. It's like you compute the curvature, the curvature is a measure of how fast my gradient is changing, and you can use this to rescale your step and say if the the the gradient is changing fast. Then I move slow into the gradient and what it helps, particularly so.

This is sort of if you're in one dimension. If you're in multiple Dimensions, you're basically scaling in each Dimension, independent key. Kind of picture is something like this where you have a valley where you want to actually have high curvature. One direction you have low curvature? Hd will end up jumping from one side to the other of the valley because the step would be light if you'd use the Hessian.

It will scale down on the high curvature direction a lot, so you end up with a direction that just goes down the valley, ignoring, without jumping from one side to another. This is kind of the. Nice type picture of. Of what this should do? Okay. So, um? I'm a bit confused.

Is it that, are we getting many are using LED with the momentum, or like? That's scaling in different dimensions. Yeah, so it will turn out that. Momentum is a way of approximating curvature as well. Without without Computing technology derivatives? Yeah, exactly? And then Adam, uh, or RMS prop is?

Yet another way of computing, second order derivative in some sense, approximating second order derivatives in axis a line. So it basically makes the assumption that the Hessian is diagonal and just looks at the diagonal elements and approximates the second derivative of the square of the gradient, which we will see that under certain assumptions.

It's actually not a bad idea. This. This particular thing is not technically used because this is very expensive. Computing the Hessian is very effective. This is more just a sort of a motivational thing. There are methods that are trying to use things that are not diagonal, so you might have heard about things like kfac or mon or any of this.

Okay, so there's all of this family of new optimizers that are coming out that everyone is excited about them, but They're not used, not at scales, because no matter what you do, Computing this time is quite expensive. And it turns out that you're better off taking a lot more smaller steps that are cheap than taking a big step that's expensive.

There's some Corner cases where taking big steps that are expensive is better. Default, you're better off doing the alternative. That's one thing. The other thing is the. Engineering over the engineering of doing. This is pretty messy as well, so Adam, it's like a one line of code in pytorch.

Their libraries that are trying to make more and kefac to be the same, but it's not really like you end up needing to understand the algorithm a bit more, and you can put a lot more to get there. So, so here, like, the head should not be, uh?

Compute, like, it's like if you rotating to this Valley not to be axis aligned right to be sort of like vertical. The Hessian will still do the right thing, because what it does it basically rotate things in with the eigenvectors, and then for each eigenvector. It knows how fast things are changing in this direction and then correct by that.

So, so the true, like, method that really uses the full hashion doesn't care about the orientations if I would be using something like Adam. Able to deal with this only. This is actually the line by X is the line. I mean, this is one parameter, and this is the other parameter, and you know, one parameter has high curvature.

The other partner has low curvature. Adam connects the learning rate per parameter, so it can give you a small lightning rate for this parameter. But if this is not per parameter, it somehow like, imagine you've rotated your parameter somehow. So, this is leaves living parameters, then Adam will not be able to deal with this.

Yes, what about? Uh, method like LL oh so, so sorry, let's go back. So those are good ways of approaching the the nutrient stock. So, there is the the map Peak, which is, I want to do something like H21 times the radius. There is a little bit it's usually do H, plus Alpha I to the minus one, just to make sure that, like, everything is well conditioned and the computers, there is some regularization to this age Matrix that we present practice, but you know, their formula, which is H, minus L to the gradient, and there is a question of how do I implement this.

How do I compute this? Rpfgs is just one partitionary for that.

Out things. Versus the Haitian is sort of like a Samo Grandpa updates, and you do rank one updated itself. In fact, it's a different approximation of this. There's like a bunch of methods there. Just sort of the what they differ is how they approximate that setting. So, what they're trying to do, technically, is the same thing.

And then they don't require really require much actually using them. Yeah, so generally, like the second order methods, one of the selling point is they don't require hyper parameter tuning, because now you don't have a lining rate in practice. You do end up with a learning rate. It's not really that, but the learning rate is considerably more robust.

Size differences is being taken care by the session, which is trying to estimate the step size that is true. Like all of these methods that your hydro parameters? Faster as well, because taking sort of these optimals at every time step. When it comes to applying Neptune neural networks, they do not work.

Work, but it's not an easy style of your thing. So, if you are in sort of non or convex optimization world or even non-commerce organization world. But, like you're working with, like, uh, nothing neural networks, any other kind of objects you have different kind of functions you want to optimize.

Usually, the second order methods are way better.

Considering more popular, they're not popular outside of the machine learning community, but they're more more popular because they take into account the structure of the neural networks doesn't have an addiction, and it's a great opportunity functionality. One, like, knows that you have this layer, and it does sort of sometimes lots of log structure, approximation of the hashtag, and it does some and help.

It's better because we know that there are some pathologies of the hashia and that counter structure work that you can exploit to make computations faster and so forth. And, um.

For this new are kind of method. You'll see that there are many papers and many groups, and many people are excited about them. They don't work, as well as advertise. So, my experience so far with this method is, if you don't care about the number of steps that you take.

You care about the number of hours on the cluster that you need to run your turn.

Steps instead of a thousand.

Um, the other thing that is not discussed as much. But is that this method? In a stochastic setting. Is in a non-sarcastic setting. But in neural networks, you rarely are in a non-schastic setting. You can never compute a gradient on the entire data set. You have to use mini batches you have to use, like small amounts of data, because.

You have to complete it on small amounts of data, which introduces noise as well. Yeah, noise. And these methods do not deal well with noise.

Are they not supposed to be equal this one and this one? Yes, uh, yes, I, why did I? Why did I say it's not equal? Exactly. Yeah, yeah, they're exactly, but I don't know. I always trying to be careful, and I feel not pretty person, but this is the example of the second order?

I mean, okay, so the way I wrote my way, so this is more like the way I wrote it depending on how you think of it. It's deceptional in the scale, like this is more like a directional derivative. So I'm just Computing the Hessian along Wonder you have to do for all directions I don't know.

Like, maybe in better notation, like, actually write the formula of the entire Hessian the way usually seen. The Hessian is more like formatting, which makes a bit more sense because otherwise, it's a bit unclear how you pick Epsilon and stuff like that. And what does it mean in the limit of Epsilon goes to zero.

We like the limit of the norm of Epsilon goes to zero, maybe, or something like that if some of your Matrix. So it's a bit weird, so that kind of reason I've I've didn't. I didn't spend time to think, what is that I correct annotation, um?

Mattresses and stuff like that?

But one question is okay, we have a formula. This is supposed to be. Ignore all the Practical aspects of it. What happens if I apply this formula to optimize for the function? The people. First of all, what is the minimum of X Cube?

On this, but like, okay, yes, but what is the minimum of x cubed? I'm sorry. So we, those support.

LECTURE 4:

LECTURE 7:

There are some things that maybe I make them sound like being super, uh, like, but, like, professional evaluation, I mean, six, I mean, there are people working on it, but it’s a bit of a niche topic. Everyone by far just stays within the standard IIDs, you know, learning with you. Cavia to that, as the field has grown, we have kind of stopped following the protocol correctly. So there is a few things that are happening. So for example, I guess, you know, if you train models, you probably recognise a lot of these things. So 1st of all, no one really samples ID from your data set. You shuffle your data at once, and then you go multiple times to each in the same order. That is not a deception. That introduces a lot of correlations in how yourself must move. That’s technically not correct according to the theory. That said, it works. So no one bothers it. It gets even worse when you have extremely large data sets, you know, like the way they do it, the way they have it in LLM. So you have 1000000000000s of tokens and those things are really light space. Something ID becomes really hard. It becomes really painful and people don’t really do it. So there is some, I mean, they shuffle things around as much as they can, for the data to move, as it is random, but it’s really not, there is a lot of structure in how the data is part. To what extent this is affecting the landing system. I don’t think it’s that well understood. I mean, there’s been some experiments and some people looking like what happens, small scale, you know, if I shuffle the data, if I don’t shuffle, there are things like this, but I don’t think it went further than that. However, if you really like, do not shop on the data at all, like if you’re not really attempting to follow in this protocol, although, my for example, you’re learning, and you’re showing all the images of zeros, the model, and the city, the one, all the images look to. and so forth. Theita was just not be like anything. And is this an extreme case, but you can sort of make the deastic stream and you’re hearing the same place like the model just doesn’t like and this is really what continer learning is trying to face. So the field of continual life, you say, assume your data is coming with a stream, you have no control over, you can’t shuffle it, you can’t change it. And that stream has a lot of structure. How do I change my learning process? so that you can still learn something out of that screen? And I’m not going to go back to the motivations and multip for reasons why you might care about this. But one of it is say if I have like a robot in the real world, the real world acts like that. I’m just looking at the world around me, like the stream of data is coming at me, I don’t have any control over it, and I have sound control, I can choose where to look and stuff like that, but I can’t really control the distribution as much as I want. But I still need to learn from it. There’s no higher interest. And this is kind of a setup that this would be learning, you would like to hear. one gave. Alright, so just go back to what I wanted to say. So this is contin like metal learning is a different topic. They end up being quite, I see that I’m, I think if we got to that part, I still have no idea how, how well, how far of those decides that I have. But if you get to the part about, um, incontext learning, like, life language models, that’s that’s a form of patha learning. So metal learning, it’s all about you, at, at test time, you know, as part of the input of the model, you get a bunch of examples and you want the inference process of the model to maybe the learning process itself and learn all the flights to solve the particular task. So whether learning is the idea that we don’t know how to train this algorithms. for example, we do HD, but we don’t know they opt you knowde not working inside the scenarios like material So if you spell of trying to figure out what is the learning algorithm, yeah, yeah, of my the learning and that’s why they use by that, is to go one layer above and say, how about we learn the learning algorithm? So that we don’t need to specify it because it seems we don’t know how to specify it, how about to try to learn from data? Which sounds kind of silly and it is silly to a certain point. Like you can’t always go up and say, like, okay, I’m going to mental level. I’m going to learn this and I don’t know how to this. I’m going to learn how to learn how to do that. Like at some point things kind of right? But in this, we metal learning, particularly kind of works, because, um, The, like kind of going back to that theme of what I was with anything, like, you know, things are not going to come out of being there, like you can’t. certain inductive biases that you need, you can’t extract them from the data. But when you go to this meta level where you’re trying to describe an algorithm that would allow you to learn how to learn, um, Basically, it’s a different space, you need to have to introduce at the end of this, any likely viruses, but because you’re in a different place, they might do more natural. It might come easier to do that, and sort of in practice, that’s kind of happening in my life, you know, there’s certain things that are kind of easier to define industry. So learning life in learning the way it works is input to the is actually a data set in some form and the output is a behaviour, you know, you’d expect that model to have They end up being a lot more connected when you go down into, like, really details and you actually look at algorithms and there’s there’s a lot of relationships in the subbed. Well, they’re different in a sense that metal, I mean, still works in an eye at least. so in principle, methal doesn’t say that you have toribution So yeah, so this is kind of the big difference between these things. Um, And yeah, part of the question was also about Jepa. which I might try to be later, but there that necessarily wayated to this. So Jepa is I think that kind ofur is pushing and he’s playing sort of above the world. I’m kind of very suspicious on that. I actually don’t even understand what life as much about that part. It’s really not that interesting. It’s very similar to other techniques that are out there, but he has his own obsessions and so Jepa is a self-supervised kind of learning algorithm. So what does that mean? So a lot of times So in sometimes there’s some things that yeah cons that makes sense. So where where do the fields started? So the field started, um, with this idea that we need to add that for sensors in order to do better. And like one particular thing that I cannot kind of use it very much on the department from the beginning, you know that, um, what these sort of deep architecture, talents, it allows you to learn representation. So this goes back to like what we were discussing, about, you know, the more difficult part about this and the part, like one of the hypotheses, or one of the theses, you decide, you need to have these theorian colour representations, that build on each other, in order to get abstractions, that will allow you to do sort of cognitive tasks. And then sort of as a high level, you know, philosophical kind of perspective on this. This is sort of a theme that’s very popular in Hokenji science and there’s a neuroscience and so generically thought that it’s really hard to do any kind of reasoning process. If you don’t have access to temporal and spatial abstractions, you know, sort of how we reason, we 1st look at the data, we abstract it, we do some concepts, and then we do something on the subject. Then there is some theory out there that says if you don’t have the subtraction snakes cannot happen. There’s no other way of doing it. As I said before, there are people who say you can. And I think one, particularly in the formative science kind of, it’s one example that people look at when they’re tried to make this argument against abstractions and centralised reasoning. So this part, like these things go hand in hand. So usually you have attraction and you also have centralisers, like particular, you know, you have the brain to have the reasoning. And then on the other hand, you have things like an octopike, I like that country, which it doesn’t have a centralised grade. So that’s sort of the maybe the most interesting thing about, um, about, um, yeah, the animal is that like somehow the the driver system is distributed in all the tentacles and whatnot. And they were completely kind of a few. But somehow the, you know, it’s one of the more intelligent animals out there, right? It’s vampire with or many other things. So there is some theory out there that people potentially do the same with a lot of attractions, but 90% of the community would say that abstraction like for sure. So deep learning, like really the whole point was about learning abstractions in order to do reasoning. So another framing of this is that we have these intermediate layers and what they do is they build the representations and they build the abstractions on top of which you could like a linear classifier or logistical regression. I mean, that’s how the system works, right? You you can have this of protection that gives you the right representation. I have the situation. And usually the way people think of this is that if I have the right representations, then everything else is easy. Like the hardness of solving any of this task. He is able to find the right projection, that gives you the right representation. Because afterwards, if you have that, and that makes your Italian any better way. Afterward, it’s easy. Like we know how to solve once we have the right instructions. So that’s why, you know, like the eye clear, to use the conference that was, that happened, you know, during the deep lighting era, it’s like replaying conference, so to say. It’s called International Conference of Learning Representations, because in fact, then everyone thought that the main thing about deep learning is the fact that it lines our representation and the only thing that we care about is our new organisation. I think this focus on representation learning kind of died off a bit. And nowadays, you know, people working on LLM, they would not talk about representations or they might not even be aware that they care about representation very much. But other definitely on top of mine for people like Canada including Joshua and others back in the day. And Jaffa is basically a mascot to line representations. And that’s why the is so obsessed about it. So the self-supervised kind of space, all it’s asking is say you have lots of data, but you don’t know anything about the data. So this is a like a density model or utity model, right? Because on data, you don’t anything about the data, but somehow you want to learn a representation out of the data. So if you want to have some process and then just all of that data comes up with a useful representation and then you can take this learn representation and apply somewhere else. I’m retraining my model using jet power and lots of data, I don’t know where it’s coming from. I get a trade model out of it, and then I use this model to do, I don’t know, I have any specification. And I just wanted to work because, you know, this model has already lied the right representation in the previous stage. When I go to this, I don’t need to learn the resation anymore. I only need to learn the Indian class 5 or so, you know, sometimes I want to even freeze the representation. I want to touch it. I just want to know that it’s a good information. In computer vision, that has been a big question of how do you learn representations from data? And Japa is one proposal in this space. There is a lot more other proposals. That’s why I don’t know why it is upressing with Jepa. There’s a bunch of others. They kind of look the same. You get these papers, one is better than the other, by Epsilon and so forth. But I don’t know how much I put on that. And… What what generally does is, um, I was just looking it up, so I will say something stupid, but I think this is correct. It’s basically a contrastive learning-based algorithm, but I’m fastly learning what you do is you take an image, um, and then is that impressive? I think it’s more like bad. but you take an image and then you take a protection of the image, say prop it or you shift it or you do something straight. And then you push this through like an encoder, which is going to be your are representation I need system. but you push this for you on that right? You get some latest representation, and then you ask this thing, these late interpresentations have to be the same, because this image and this transformant image that you’ve done, they’re kind of the same object. You assume that they have be the same object because you know we just done transformation. And, um, so this is kind of like one, one part and then I don’t know if, but I, there’s also many variants of Japan, Sonar, between Grand Arama, which one, yeah, electronized. But then you can either have like a capacity loss where you have another image that is different, and you say, okay, for reminding me to transform me the dish thing together, for my region, a different image, I pushing apart. And that’s it. And you keep doing this for many, many pairs. And then over time, you learn summer presentation. that representation and being relative robust. UC is not, there’s many techniques out there for some. I don’t really have slides of self-supervision. I thought it wasn’t necessary. I thought you I wanted to cover it also because it’s more of a computer vision kind of thing and I’m not really going to be a vision person. But this is sort of generally the mechanism that they use. I mean, there’s other Americanchanians changes use, but contrastive learning, which is you take things that supposed to be similar push the representation together. you p things that I meant to be different and you p back. It sort of by far one of the most popular kind of objectives and it has different flavours, but this is a generic form.’ been used in RRL as well. in a lot of places, because I really have the same sorry, I’m just doing lots of par, but in I you have the same problem. Nareal, you have a deep network. so you deep out. And you have your other objective adult. And what that thing has to do, it has to do with the thing. It has to learn a lot of information that takes care of reservation. And then from that representation, it has to learn a policy that. And the issue is in a world that the updates that you get from are a little objective are extremely annoying. that is because the data is messy all all of those things are nicely. So usually I rally seen as a very noisy organisation process for the organisation process. But, yeah, it’s very annoying. So because it’s very noisy, it means that it’s really hard to learn representations. compared to supervised learning where I have the signal like I found that when I’m li to learn very quickly how to protect the image and into proper represent If I take a rice net from French and I try to train to be enforced or or or do lighting or whatever, it’s gonna take forever. So in our folks have been playing with the idea, what if I have an additional objective, that is just there to learn the representation? And the RL is just there to learn the policy. So you have the systems that have a clear losses that you attached to your model, that are doing something like a trusty life or something like that. to just kind of how the system learned faster. So that’s for one choice that you can have that might make that change faster. So, you know, people have been playing with all these kind of combinations to certain extents, sometimes in the more popular thing, but you can imagine how these things come together. And the other thing that people have been talking a lot about it, and sort of that’s, I guess, sort of young influence and also, this idea of 12 models, which is if you have a model that can generate data from what you’re seeing, this is a world model, then, and I still don’t know why they think it’s that important, but the whole point of the world model is that it’s something you can use to plan. So instead of interacting with the world and posing hard, you have your own world model and you can interact with your own model. And you can do search in your work model and you can do all these kind of stuff. So the essumially some task, you can which is kind of like an interesting view. Some things you can’t really learn because they’re hard and really the way to solve them is to do some kind of search. Like, because, you know, like it’s you’re never gonna have the right structure, so you need to have like alpha go work like that. right not your policy that will way what the next uh key. But, uh, after that you do MCTS, which is some kind of search, right? Like you basically try many things and you try to expand them and then you need something that tells you like, if you come up with this kind of trajectories, are they, you know, like, in the world model that will simulate the world to tell you sort of what would happen. I take this action. And this is sort of another place where this Japati comes in, but Yeah, I just got back to the original question. I don’t I don’t see why you think it’s such a big thing. I mean, I definitely agree with him that we need to move away from scaling at some point. I mean, even if staring is a solution, like, It’s just not healthy for all kinds of ways. So if we know or if we think there might be another solution that doesn’t involve scale, I think a bigger spot is for free chug, but the community, you jump on that and work on that. So that’s like, I think everyone in the community usually agrees, like no one likes scaling except some groups that have the ability to do it. And they use it to type up their monopoly or what’s going on, which is fine. But, you know, research is in general, there will be priority to have an option without failing, and everyone agrees with that. I just like it funny when people try to find certainty solutions by it, but they actually end up with including. Yeah, and these kind of places about the Japa, which I really don’t don’t see the Japanese scale will go out. So it’s not like Japan body is going to work last year. So that doesn’t really feel like the chance the problem is kind of make. I have a question about that representation, learning. Yeah. Why would the person in this or until they get as many of the medication, their name, as as far as I understand, we don’t compare the actual factual differences, but we focus on their representation. So for example, if I’m using muscularly included, I’m comparing the actual process. But it should end up doing the same things, because, uh, threatening space in the photo-included is going to end up having the presentation. Uh, I, I have all my history with it. Yeah, yeah, so I guess your question is, you have systems like, like, jump out where you end up, comparing things in relationships, and you have reconstruction-based things by myself and Coder, where usually you take some image, you encode it, and then you decode it, and then you compare something to people. So, um, I mean, we decided to start relative churches that you have, I think people nowadays tend to believe that reconstruction base doesn’t work. So the biggest issue you have, reconstaction is kind of the 1st technique that people came out with like they have all the en involved that probably from the earlier and days in the field. I think there’s some, I don’t know exactly that you can, but like because factions aren’t very reasonable in other places. The reason why people are slowly moving away from it is basically the distance back in the in the interspace is really problematic. So, like, use newspaper error between pixels, but in computer, you can, you know, you have lots of paper that show that newspaper error is not reasonable. Like, things that should be very far apart, become very close in in the Easter era, and things that should be very close, are very far apart under these metrics. So because of this, um, at least on images where a lot of these research are done, um, reconstruction doesn’t quite as well. So you have when you start learning, you know, the lows start going down, things are okay, but then once the lows go below some number, like it is not really clear that whether your knowledge is moving or not. Like unless you can really construction going to zero, like your improvements are now sort of the way you want to. So moving to these kind of businesses in the latest space, usually the assumption is that the latest space are more semantically meaningful, like any movement in the later space is meaningful in some way. So when you do news error or like any other, other sort of southern distance in India, like the space, um, that sometimes the presenter, it just works back, right? Like VR problem. And I moved to NLP, and I say, where is more of a representation learning? Because my own average is not, wasn’t included. So I’m maybe maybe… I haven’t, like, I’ve heard this. I’ve heard the, like, without the plane, I have to progress. I’m just going to be here. Um, I’m actually, I mean, I might need to lose something, but I don’t remember how bad it is. But I think it’s, there is no deep, the difference between vert and SA, ChatGPT or Gemma or any of these homes. They’re both doing reconstruction. I, like, yeah, I think the same market, knowing approach, that’s similar to mostly being good or general. aggressive way. But, like, even in parachute, still, like, classify the total. It’s not it’s not like you’re doing anything other. So it’s not like they’re both reconstructing the eucospace and they’re using the same… I mean, the difference is that in the token space is much better behaved than the decent place. That’s why, like, um, you know, language or energy move way fast, right, then vision and vision is a bit behind with these things, even. But that is not the difference. I think difference between Burton and transfor the one is aut regressive, the other one is not. The other one is doing the math thing. Um, To be honest, I think, um, I don’t know if we have or if we sit down at a time enough experiments to understand the difference between author aggressive and sort of the math version. So I think there is this assumption that, you know, like also we have to work better. I don’t know how well but understood that is. Like, the point was the open eye came up, which has BT, and then immediately started skating it. And then, like, when you have models that are much, which are trained on much more data, you’re not comparing that most level anymore. So with a bad work, if you buy the singers, tail, train on a similar data and all of that, I don’t I don’t actually know I don’t even think that the paper like this, but there are other reasons to like, um, um, you know, the relationship version because it’s closer, and language feels naturally because, right, you want to, just delete one word at a time and you go for it in a sentence, and that’s how you write things. So I think it has a very natural vibe and I there is now a tre my mine I talk quality using nutation into the language. And it’s not quite, but it goes back on normalising the sort of non-puzzle things. So the diffusion is generate a whole text of ones. And it’s just kind of rather a few words in and really kind of, so yeah, so it behaves a lot more like words, right? Instead of just starting one word at a time, you just, you just start with some random words, random positions, and then you still work left and right and so forth until you get to the right place. And, um, as far as I know, on the text diffusion side, um, it is to a point where a lot of people argue that it works equally well with a non-authory test of things. So, um, and then there is a question of like, is it any better? Like, why would you do the fusion instead of, and there’s, there’s some reasons why it’s better for some reason why it’s worse. I mean, I don’t think any deployed system is the tax decision to generate tax, like all the big models you’re interacting with are all very aggressive, but on the side side, there are papers and there are people working on this and they’re claiming that at least in principle you can get the same quality of the property. And if you need a lot of money e for it. for that. Yes. Can this be connected to explable AI? The fact that we are going to learn, like, a good abstraction, which can lead to a better understanding of the of the structure or… Yeah, I, um, I mean, at the high level, yes, and I think 11, um, if you, if you learn attractions, yeah, then that, that’s sort of one way to sort of boot your, your explanation because, you know, you, like, if the model really learns subtraction, there are semantically meaningful for us, then it’s very easy to say, I’m going to look in the model, I’m going to say, okay, this layer, the model is doing this, you know, trying to contract this attraction, then it’s presenting objects, and then now it’s reasoning about objects, and now it’s taking a decision. The big extent sort of the silency maps that we are talking about. That’s why they’re trying to identify. like what are the abstractions or what is the model really trying to do at this unit in this layer of this model? Um, And I think, again, I’m not an expert on explainability, but I think a lot of the literature and rely on this kind of things, right? So they, like a lot of it is, it is about assigning sort of semantics in different parts of the model, either to felix, or either things, they then do some projections, or they do some other tricks, they make it less noisy or more robust and so forth. So, um, At a high level, I think it is true. I think the, I mean, the only caviare that I keep bringing up for the ability part is that, um, It does some kind of abstraction, like from even from that picture of holding the space, you can, you can think of that holding as basically abstracting the space by compressing it by, by, by putting things together. For those attractions, like what I’m, what I’ve been more suspicious about is to what extent the semantics were attached to them? what extent that holds. And this is because as I said, you have a son said examples and all of this stuff. So, um, the issue is for for us, when we build that fraction, they’re usually very well encapsulated most of the time, you know, they they have problems. be only one thing. So for neural networks, it’s like that support for a unit ends up being sort of infinite somehow. Like, because you have like, okay, you see this response to replace it, but it also responds to regards to noisy and also responds to other stuff. and it’s really hard for us to put those things together. But I think by life, this is sort of how the things are being used to you, but as some semantics and you kind of go through it, you know, that might be Santavia, there’s a way of breaking, there’s explanation, but You know, that is fine. Like, it’s still kind of, you do the same kind of makes sense and you still can guide someone to kind of think that outfras going on. And if you have a fixed input output, it’s actually correct at some point, you know, that’s exactly. So the issue with these things is usually not, Can I explain why I made a decision for this particular generation? Usually the problem is, can I claim that he’s going to do the same for other similar images? And that’s why things usually kind of break up. Can I generalise this to say that the mother will always behave this way? And then that’s a little bit more triy. But for even even you can definitely read this these attractions out and say, wow, this is what you’re doing is looking for you. Yes. When we talked about other and, uh, and, uh, digit of algorithms, uh, right, and the city is that the model is going to learn the representation of the environment. So, 1st it’s gonna learn, uh, in some of the obstacles that we have, the, uh, let’s say, worlds, weather. And then, uh, the policy gonna learn that, okay, he would have to avoid the world here. we have to do that, he will have to move.. So, uh, if you use it within our, so, like, the Jempai itself, it just focus on the on the representation writing. So it’s just gonna learn, like as you said, like something about the environment. So usually, I mean, if you apply to computer vision, you’ll get sort of a similar kind of structure that we typically see where 1st layer learns on the board filters and then you learn some kind of part of object and then sort of in the, from the representation, for example, you should be able to linearly classify different objects. So you somehow in that representation you have encoded that this is a wall, you know, this is an enemy or whatever you have in the game, this is this and this and that. Uh, so this should be somehow represented, um, I mean, the representation can be distributed, so it doesn’t need to be like, I have a unit that fires whenever I see a wall. could be like multiple units that fire in a particular pattern for a while, but they’re linearly separable. I mean, that’s sort of usually what Java is trying to do. is trying to make representations that are linearly separable. So you can use a linear layer to decouple these things and do whatever you need. Um, and that’s all that Japad does. So Jabava doesn’t have the other components, like just learning the representation. And then on top of it, if you want, you can put like a linear layer and learn your policy if you’re doing a route. And then policy will be like avoid PVCDs and that. But yeah. Yeah, but, uh, If they are a problem, I’m not, I only have partial view of the vitamins. So, that’s, uh, how, but they have a, they ended organisation, so that, you know, like, whatever, they don’t have the lions, stuff. Yeah, so so the Japa will, or if you just try and sort of jump out on, on, on, on the partially observable thing that you have, you will just learn how to reason about what you see at that point in time. So what you really need is sort of a system that has some form of memory because you need to collect information over time, right? And if you have multiple partial views to make sense of where you are in the game. So in that case, I mean, Japal in its own is not a current model. It’s not it doesn’t have memory, right? It’s just sort of your typical continent style thing, right? You just go forward. So for partially observable environments, you will need to add on top of that representation, either Transformer or on RNN or something that has temporal extent that can integrate information over time. Otherwise, you can’t give, um, there is some, like, the only Javier that I know, because I was playing with this at some point, like, we are trying to solve mazes, to think of that possible, and you have only a partial view of the maid, so you can only see, like, a few squares around you. And I felt like there’s no way of solving a maze, unless you put together all the partial views, kind of get a sense of what the maze is and then find a solution through them at least. But if you run around on this, it works perfectly, well, without any memory. And you can solve it. Uh, and the reason for it is is because there’s a simple solution when you guys really know it, which is always, always go there. Whenever you have intersection, you always don’t have. If you’re amazing, you always do that, you will find the exit. I don’t know, it’s just sort of a give. So you only have yet is sometimes partially observable past by look hard, but there might be some suboptimal solution, but good solution that exists there where you don’t need to understand it. So you need to look out for that, but otherwise you need memory. But, so, you mean, for digital, you mean, you mean? No, the algorithm does not need. I mean, it’s not meant to deal with the partial observability problem. Why? And, well, for the difference, you don’t believe it is also free, new data, like you presented, you have information, and then not there about the new kind of equipment. No have to use fast pay knowledge that you already have about this image. to predict, like, I guess we’re talking about sort of different kinds of memories here. So I guess what I’m talking about is a really important partial observability. A team friends family is never seen entirely much because he glimpses and you work in the environment. So you need to like put all of those glimpses together in order to know what you’re doing. And you feel like you’re almost talking about this over the training set. like So in some times you Japai is compressing entire training. So it has actual training set. So what Jepai is not going to be good at is, um, is if you need a memory at inference time, like if you instead of seeing a whole image, I can only see a little square and then I have to move my eyes around, kind of get to see everything. Then I can apply it there independently to any of these glimpses and it’s going to analyse the gate and dummy what’s in there. But this doesn’t mean that I can, you know, I need something to put together all of these little pieces. It doesn’t need whatever. like, it’s. No, because not not the task trying to do. It’s only trying to give you the answer about this particular thing that you’re looking at. But I’m just saying in, if you’re, if you’re thinking about an agent in an RLSA, where if you’re thinking about like an overall AI system, the defence system needs memory because that system will literally integrate over time, post search things. But Japan is not meant to do that. Like that’s not their best job. interesting. Okay. Well, okay. What’s the time? So I’ll go tracking the slides. Wait, is Let’s do the 5 minutes, right? arriving no, is right. Okay, okay, okay then bit. Okay, so I already mentioned this stuff, so we were we were like, yeah. need something. sorry. Yeah. Yeah. No, no, no. So then, I didn’t go earlier, but you got any problem. One is from the architectural…onents there’s no proper solution to the field at the moment. It’s just ideas and there’s upgrad them with better words. and they approach it from all of these directions. In some of these are near always fire. So, couldn’t they, for example, in the last week, without, um, elastic weight presentation, the AWC? No, the main distinction from… I mean, there’s region goes so far, but it’s magical parallel, that the, the, the, the, the, the, the, and we’re trying to learn from that. Um, But yeah, yeah, yeah. It’s both of them. I prefer the learning. I preferred that critical annal, but there is plenty of work that are looking purely from an arcade programme. And, and then, then try to solve it that way. I’m not sure that I’m gonna be okay. So, community, so we started a bit of motivation, so we said it’s important because HR is everywhere. And we have this issue that a static system is not good enough anymore, right? Because things that change change all the time around us, which means that the way this is been done for at the moment is you have sort of this regular retraininging of the systems and redment, right? It’s like sometimes the, you know, your ChatGPT, Gemini, whatever, you know, needs to be deployed. Not necessarily because they improve something on the architectural side, but just because it needs to keep track with the world. You know, if we still would have sort of the chatpot from 5 years ago, then you do not know about events that happens since that. Yes. Is it retrained from scratch every time? Yes. So there is this law? Well, yes, yes, yes. it’s always driving. So you have, usually, I mean, every system is different than my numbers might be wrong, but usually my take anywhere between 4 to 6 months to train one of these speed models from scratch. that happens on a huge class service. That’s another from huge. So they started regrand and then after that there’s a few months of like post-training and they do all kinds of things and then they destroy it then. I mean, so far there’s always been also like additional changes, like making them a little bit bigger and changing something here or there that’s been exploited in the meanwhile, but even if people are not explicit about this, part of it, it’s also the data said we have updated, and particularly gets updated with new things that happen in the meanwhile. So if we’re able to make the model adapt instead of like, if we’re able to make it make use of its pre-existing knowledge, then that would save a lot of money, right? Yes, that is the premise of the field, of the media learning, that we could make this work. there will say lots of money. There is a sense in which, I mean, to be fair, the big companies now are quite interesting. But there is a sense in which the fact that it’s so expensive to pay is an edge that they are exploiting. Because if you’re not interested training from scratch, then a lot more people could do that, and then you’d have a lot more competition. So there is this kind of a mode that sort of is fact that you need so much computers or anything. It’s sort of in some sense comforting for some of these companies. Like if you are to do now a startup and you want to compete on this particular fraud, you need lots of money just because you need lots of computer. Like, assuming that you don’t want to innovate, you have the recipe, it’s just about implementing it and you can implement it by chips somewhere, you still need a lot of money because there’s no way around it. You need lots of computer training. But then there’s still a limit to how the big companies will perform within the limited context, right? Yes, yes. And I mean, there is lots of things that people don’t like in, so for example, the way it’s working right now, you have, I think they call it a hero run. So you have the hero around, which is like 4 months, whatever, 6 months, retraining of the model. You can’t afford to do multiple of this. You can do only one. So you cannot hypertune. If something goes wrong, that’s it. Like for that year, you’re gonna be set behind, right? If you do your run and it fails and some other competitor does, they run and they run words, you know, people will see that, but the next iteration when the model comes out, and then you want the switch, and then you fancy changing, it is behaving better. So there is a lot of pressure on this on one particular run that it has to go well. And obviously, I think, particularly, the people working on that, I’m honestly happy. They would like to have the weiggle room of saying, I want to try this. if it doesn’t work out, I can rerun it and do something, right? So it’s not necessarily a comfortable place to be. I would say. It does create sort of this artificial note that is just few people who can do this. But at the same time, even for those few people, like if you call the only one person who produced that, one in one company, then it would have been fine. But because there are multiple companies, it’s not a very comfortable place. So that’s why I think, you know, people would like to have either more efficient learning or, I mean, it’s not always about continual learning. Some people would argue that maybe there are other ways to make this be cheaper. but you know, they would like to have an asset. It’s a very good problem. So I think there was a time that you came up music. It’s not that cheap. Yes, it’s cheaper, but it’s still not that cheap. I mean, if you want to try to train deepseat model. So, okay, it’s also about scale, right? So I don’t remember what scale GC is going for. But, like, if you try to train sort of in the 100s of millions, which is sort of what the big models typically are, 100s of 1000000s of parameters, is actually no matter what you do. Like the deep sea recipe is not as different from the standard recipe. I mean, this is one thing that is, in some sense, sad is if you start looking into it, like what different companies are doing, it’s almost the same. That’s why there’s all this talk about the data matters because it feels like maybe the data that they use matters even more than the other details. I mean, yes, Dipsik had some tricks in there. Pretty sure all the companies are using those drinks by now. But these recipes are not that different and how expensive they are. It’s also not that different. And like the issue is really like when you start scaling. Like if you, I mean, there are startups and companies that work in sort of in this range of like, I don’t know, 9 media into 37000000000 parameters and like that. But this really is considered a small model. This is not considered as a, and these models are limited. like in terms of the behaviour. Like if you interact with a 99000000000 or 2070000000 model, you would set the difference from interacting with like a few 100000000000 model size. That’s why people really like this here. So really the problem is when you go, the exercise are large. And then once you have the extra light, you start distilling it or making smaller version. I mean, obviously the model is being served to you, say, on the phone, it’s not going to be the 100 VM model because you can’t use too expensive. So, what you’re driving on your phone is gonna be a smaller amount, like a, like a, I think on the phone you can’t have more than per wheel. If it’s running locally on your phone, it has to be below for video parameters. Otherwise, you’re just not going to see it. it’s not gonnarup way to be. So, um, so in terms as a lot of the seasons, you do, as I think racking weight are a lot smaller, and, you know, many people could because they have, they train them. But a lot of time the smaller models are also trained through distillation or they’re regularised with respect to the beaver model. So having the bigger model helps because you’re going to get much stronger, it’s one amount. Um, and and it’s hard to use somebody else’s bigger model because, you know, they will be able to tell if you’re wearing their model enough so that you can retain data for yourself. Yeah, people need to type that. Like, I mean, you’re not going to have, it’s going to be very expensive when you like it for already that much if you model. projects. But yeah, this is roughly… I think the scale of things and the impacts. All right, so. So we talked about reasons for one reason is a world keep changing. The other reason is The idea of simulating sort of stationarity, so smoothing the ID is expensive and it’s computation. So one question that I think is pretty problematic is this of coverage, right? So if you say, I want to train once and just be able to interact with the entire world and know everything, then this time when you train this point when you collect the data, you need to have this coverage to cover all possible cases that the model might be asked about later. And this is not an easy question to answer. It’s, I mean, people do things like, okay, we crawl the entire internet and stuff like that. But there’s no guarantee that it’s so covered everything that you want or even if it covers everything, it might be completely imbalanced. It might be that there’s a lot more data about X than Y and somehow you end up not learning why because of imbalances and all this stuff. So how to collect it, how to construct the data, to train this model, is not treated by any chance. There’s a big problem. I believe that the way being told is mostly like realistics and text. There’s like sequence recipes, like different data sources, these data sources have different weightings, how they train the models. There have been obtained more or less impurity. And if you if you try to push people to tell you like, why does Wikipedia has a wake-up point itself, and this is a weight of what that, they would not necessarily have a good reason. They say, like, this is what works. You know, this is our secret recipe. This is how it makes things. If media has this way, I don’t, we have this way, this has that way, but then, you know, there are other sources, and, you know, we grow this, and then you have all the filtering that’s going on, where you have pro data, where it is toxic, or whatnot, and how you define those filters, how you detect, whatever you want to filter out. It’s all like unistic and it’s all sort of part of the secret source. But it’s a really high problem, right? If you end up in a world where you don’t need to have this much pressure on having comprehensive data of everything, that’ll make that much, much easier. And material learning is kind of promising that the other thing we talked about is performance and how we evaluate systems. And again, like a scale evaluation becomes hard because now you start by being basically the train, the test set is not an independent, unbiassed sample compared to the train set. And you have this kind of contamination going on because it’s really hard to even know what’s in the train set, to know if sort of the the thing that you’re putting in an evaluation, if they somehow not leaked somewhere in the internet and been collected in your training set without you knowing, you have all of these kind of issues. The other thing, it’s hard is that we also now start to deal with the skills and it’s not clear how you would measure, whether a system has a skill or not. There’s also a question about human preferences, which are not really easy to put in a better part, right? Whether it’s an educator, other than the others. But overall, like, um, By 4 months as a metric, let me put it in like the P have been upessed for a long time with the idea of bagmarking. You kind of borrow from computer vision because of internet. But for a long time, the way the field works, we have subnumbered. And all you do is you have to make the number go down. You know, whether that’s the or up. The accuracy, animation, I think, on, that’s your number. And nothing else matters. You know, you can basically summarise your entire model to a single number and you know that, you know, this number has to go up or down. And the issue is as things are evolving, you can’t summarise things for number everywhere. Like, you know, you can’t just say, I’m going to attach a number to my LLM and the only thing I care about is this number is lower than the other number. There’s like different ways you can evaluate and you can’t just average those things together. They live in different faces that sort of have different properties. So, um, and and that’s sort of where we are, right? Like, benchmarking, it’s all about being able to take this table, where you have models, and you have a number, and you can you can see which one is lower. And this branch marketing, which is sort of so inherent, you decide, you well, it’s not working out anymore. Um, The other thing that I wanted to say about motivation or why you want to move away is when you start on this road of ideas, I did learning, you’ll end up where we are right now, where, for example, when people are taking these models, you can read one in one big model, right? So then you’re kind of forced into this one size speaks all kind of shinking. Right now the expectation is I’m gonna have, my next model is gonna be good, coding, you’re gonna be good at this, you’re gonna be good at that, it’s gonna be good for this culture, for that culture, for anything. So this is not optimal in some sense. And actually, you know, depending on what your goal is, like once I speak so doesn’t, doesn’t seem like it’s sort of the right way to go about it. And of course people do create specialised models for different things, but I think it’s far away from saying having specialised models for individuals. Like, you know, I have my own kind of personal experience with my own both and it’s kind of adapting to how I interact with it and stuff like that. I mean, I think that is not really there. I mean, it is similar to a certain degree, but it’s not fundamental, part of the model, the model is always the same because, you know, this is how things are been trained. I don’t know, I… Let’s do the break and then I’ll come back to this because I think it’s a bit more interesting and I don’t want to rush it. So let’s do a 5 minute break or 25 minute break and then we’ll come back. and stuff, uh, work together. Yeah definitely. question. You know, here, good sir. So… But enough… Because I can say, do you say by a person? Do you say something as well? Yeah, somehow we are… What about… And so far as I had… I’m happy with the… The last one… And the doctor… I can… When I’m too lazy to try to come along. TSP, LLP will work, so… No, no, no. popular reviews. This… Ah, because… And I’m submit on time across AMZ. Yeah. Alexa, I can finish me. So, mission. the extension of a foot… on that… No, it does. time As long as they maintain in the section, the goodness, monkey is a headache is a lady, they could pass my distinction. It doesn’t matter at all. Astronomy or… What is it? I mean I see what you mean. I like the reason why I don’t do not just dont par like I don’t need to 1 andose my something and then. Yeah yeah yeah I see Yes, I haven’t… It’s getting me.. Anyway, it’s not a show. Right, then… No, but I think, like, it depends on my commission, right? You are the most… I don’t like why they check them on your list. Oh, yeah. Yeah, yeah. I’m worried. I say, I said, if you were you… I was going to, no, no. it doesn’t matter you can at this point doesn’t No, no, no, I don’t care. So before I I think that now should be available test. So there’s so much leave ar. Yeah, if you have any feedback on a, let me know that we might attack co then and take the 2nd homework. I don’t know. So there’s any – Yeah, I mean, if you have questions about it, I mean, now, I mean, where is the popular anymore? But much there, I will you guys to at least read it, try and get a sense of so that you kind of plan ahead and and. So I think it’s due next Wednesday I American. Um, and it’s a little bit of an edit, sort of totally worried about it. right? As I said, some of those things I’m going to say, none, please dance. You just need to give an answer. and see what the way that goes. Um, But at least I think about it. You don’t have any good things for, and then you watch. One’s actually, you should go away. Okay, so we talked about the demontivations? I’ll jump into a new one and I want it, I, I, I came to suicide because uh, this is uh, Rich Satan’s thing, continual lining, and, you know, he has a lot of influence in the medical, in general. So it something is I don’t know, people talking, the Godfather, Barel, or whatnot. He’s the guy who wrote a book on RL and he’s done a lot of RL words. so he’s sort of like a very deep name one learning space. And he now has a research institute in Alberta, which is a lot about continuing. So like they basically decided that this is the topic, you know, to worry about. And he perspective that I have some issues with is this perspective as a world is bigger than the So this is his right name. So, um, the way this is friendly saying, well, having an agent that can learn everything about the world is impossible because the world would always be bigger than the in the agent. And one way he’s framing this is saying the world contains other agents, including yourself. So the world will always have more views of information than you can score because you need to score yourself within that. So this kind of philosophical kind of thing. So he’s expectation the world is my favourite an agent. Therefore, you will never be able to beat it. You’ll never be able to feed everything that’s in the world to in the world. So the best you can do is you can learn sort of things locally around you. So you you kind of keep thinking totally and then as you as things move through the, as you move to the environment, as things changes in the environment, you need to reline, right? Because from, even though, let me, let me change, like his point here is, even though the world might be starting, so even though, say, there is some type of distribution that will design everything, then, you know, there is so, so if you have that distribution, you could put things in an ID, right? This distribution is so complex and so big that the best you can do is to feed things locally, and then us other things evolve, like you keep unlearning. So from the user perspective, from the agent perspective, things always have to look non stationary. Because the user does not, the agent does not have the capacity to represent everything that is seen in the past and it’s seen now. can only represents sort of small portion of it. So the only thing that you can do is that you can sort of keep learning all the time. I don’t know if this stake is makes sense to you. But this is sort of the framing that is very popular in our. So this is kind of’m framing you’re finding a lot of other and papers. this perspective that the world is bigger. There is some… There is some interesting word from Kumar Athols. This is when Roy Paper, you can find, this is a very mathy paper. So Ben Van Roy, Ben Van Roy, is um, like theory and also likes probabilities and statistics and and invasion stuff. So expect to have all of those flavours. But it’s a theory where they show that in the limit, any learning system has to, to, to, uh, to place continual life. So it’s kind of relying on this world is bigger than the agent, and it takes a very information to aritical kind of point of view and argues that, you know, there’s always going to be more pizza information that you need to learn about and somehow there’s not going to be anywhere around it. So, like the details of how they do it and I mean, it’s sort of a nice paper if you’re into this other stuff and, you know, I’ll definitely recommend reading it. But it’s, you know, not crucial. But the point I’m trying to make is not only is this a very, like a perspective that’s very popular in the RL. It even has some very strong theoretical underpinnings, right? So they have these kind of work and they’re smart, and and then when Roy work or other works as well, they are really trying to formalise this from an information point of view and show that under some other assumptions, this always has to happen. So you will always, from the learning system perspective, they will always be foundationary, even if they’re actually stationary, because of the lack of capacity that the, um, the learning system has, um, this will translate, you know, kind of nauseationary, actually, you know. Um, any – any – OK, so I cannot tell you what I don’t like about this. So what I don’t like about it is that it’s one of those things that it is true. and in the liric, this is gonna do what’s gonna happen. uh this sort of theory tells me that. But I don’t like about it is Because in practice, this is not what you’re facing when you’re trying to deal with this nonsessionity. So, um, there is a few themes that are complicated here. So 1st of all, we like, We don’t know what is the capacity of a learning system, right? We’ve talked about this and so we’re very bad judges of looking at the address and saying this is how many metre information can these things go? And that’s why we had, you know, surprises where people believe that, you know, the restaurant cannot be able to memorise the image net, and then, you know, they can memorise the visnet and so forth. So the models we use are really high capacity. have not actually know how to measure that or how come expect it. If you had a bit that I, that I, and then on the other side, if you look at the world, the world might be complicated, but acting optimal in the world doesn’t necessarily have to be. So this argument that like a world also includes me or other agents, I don’t fully buy it because there is fine, but I don’t move to model everything that goes for someone grain in order to be able to interact with that person. Like the policy might require a lot less to be some information. So obviously, if you’re trying to create your world model, internally, that world model would be imperfect, but that is partially the point of the world model. It doesn’t need to be perfect. Like it basically is not clear to me to what extent being able to model or eat everything about the world is even necessary for whatever policy that you have. So these are 2 type yards that I have that kind of makes me a bit more suspicious. And maybe the part that I hate most about is, I don’t know if I have another slide in, so I don’t know. Maybe the other the other thing that makes me most suspicious about this is it’s basically the effect that this happened in the other world. So the other world, I decided that being, is that, you know, continue learning what we used to be, we used to go to construct these very controlled settings. where you learn that one, you wrote the that to, you go to that, right? Um, and then there’s a lot of visual, this kind of construction that people don’t like, but there it was 3 years. Like, okay, you’ve learned this, and now things are going to station it because you switch to have to do in B. At some point, you know, people are saying, you don’t have to do that. You just need a very complicated environment. And you put the agent to learn in that competitive environment, yes? Um, I was just asking, does this idea have a, a, a relationship in entropy from information theory? Yes, yes, that is sort of what Ben and Roy is trying to import in that word. So it’s really connected like you can you can take sort of this particular perspective. And then you can prove on there’s some other assumption that this has to happen from an information point of view. Then I forgot exactly how the paper works, but you can open it up and like if you like this page, you’ll easily see their sort of the formula, but this is basically what’s happening. It’s sort of like the amount of pizza information that the agent needs to to to to store is more than it can have. better. Um. Okay, so the issue that I had with this, also sort of this tenancy that I’ve seen in the fields, where people were like, let’s make in a real environment is very complicated. And I put my agent in there and then it has to continue learning because, you know, because of this hypothesis, you know, that environment is complicated, you need to learn hypothetically in the city, so it will have to simulate it here, I will. And I didn’t like it because we’re really bad judges of these kind of things and I’ve seen it over and over again. Like, for example, in D-Mind, at some point, we switched to like 3D environments, where they thought, oh, this is going to make their wealth problem, much faster than a star, it. Anytime out not to be much harder. It just computation is more expensive. but like, you know, whether the environment is pretty or 3D doesn’t necessarily make a polishing or comput. So it’s basically I think it’s a very nice hypothesis. I think it’s a very good rounding of the field. But it has sort of some caveat where I think the, in practice, when you learn in a notationary setting, what you’ll notice is that the issues that you are facing is not the fact that the model was too small, but it has more to do with the learning dynamics. The way learning works, it kind of breaks what you’ve learned, and it has nothing to do with capacity. That’s has something to do with the agent sentence of itself. Like, basically, this is all our only my worry about this. I think this is a very cool proof of of existence of the continual learning problem. It basically, at least the math a bit, the information already, it’s kind of like a proof that you can’t avoid a continual learning problem. At some point you’ll have to deal with it purely because of his property, that the agent is smaller than being the world. Well, like what I don’t like about it is that the problem that we’re seeing in practice when you try to train systems have nothing to do with capacity. They have a lot to do with the land algorithm. another aspect of the the process, but they don’t have anything to do with capacity yet. Why worry, when this came out and everyone was talking about the left and right, the people end up focussing on the wrong aspects because they start worrying about capacity and how many models could have more capacity, and that wasn’t talking the problem that we were actually seeing in practice. So this is sort of my people because you know 30 But yeah, it’s like maybe one of the more established pers factories of in theining. Um, Just kind of continue, I mean, this is enough, uh, there are other reasons why you tear up, material learning, we’re discussing something similar before. The sort of feeling is unsustainable, you know, like this large system that we’re training, now there is an ID, because we have lots of energy, that has a huge impact on the environment. It finds things not equitable socially. You know, not everyone has access to the system. No, they want to play with them. So there is a chance, I mean, this is not a given either, right? that you have this adaptable systems that things will be differently, and you’ll not need to retrain this system called French, or you could alter them more meaningfully, sort of in a sustainable way. Um, Another reason that, uh, continuing as a here is quite important is because sometimes we need to online. Actually, I’m not talk as much about online either we sure in this life. But on learning is sometimes it’s just continue aligning with the minus thing from. Continue learning. It’s really about I’ve learned something and I’m learning something else and I don’t want to forget or I want to improve on what I’m doing in based on what I’ve had before. Un learningning is saying, well, I learned I learned a bunch of things, then I want to delete part of the things that I have like. And the methods tend to be questionable, yeah. I think it would be, I guess, as far as I’ve come across it, like American history, the beginning, the degradation model is performance in general, because it affects the home situation builder, what? So, I mean, it depends on the kind of underlearning algorithms that need the views and how you purchase it. I think as appeal, this is, it might be pretty young. And some of the ways I’m aware in this space to me are just feel like hacks or things like that. So like one popular as far as I know, one popular underwriting algorithm is you take an ascent step in a direction of the data that you won’t forget. So, you know, you have some data that you want to forget, the computer gradient on data, where you put on the data, and then variation will show you the direction you need to move in to learn this data more, so to say. And you just inverted the sense of the guy at the moment in that direction. And completely by some, okay, like, okay, I want to forget, I just sort of walk backwards, right? I myients forget it. But if you think about this, this is a divergent process. It doesn’t have to provide anywhere. Um, and you can see this because people that do this, they, they have all kinds of great statistics of how many steps you take, how to stop, and things like, because it’s like nothing that really, you know, like what you have made and you said, you stop when you converge. like, you know, you don’t need to worry about these things. I, I really don’t like this concept of, well, I can go to the radio. Um, Yeah, so methods in the online space, I think that at the beginning and there is a lot of work to do. Um, like, um, one thing that I haven’t seen that much, and I think for the discussion with a few of you guys, yesterday, is, um, like, we, we could hear it a lot of time, you sort of identify the subset of parameters with one or back, and, you know, by, by just looking at the speaker, nice and stuff like that. And then you protect the away any element of the gradiients that I think other parts of the model, between you shouldn’t touch. I haven’t seen a lot of those kind of methods in learning. So one thing you can imagine is you want to make young learning signal, even if you’re doing this ascent stuff that I don’t like, you want to make it localised. You want to say, like, it’s not going to be applied to the whole Mikita. I mean, the 50000 family. As far as I know they don’t do that. They just completed but at all that good. But yeah, I mean, in general, I think that the truth of it is learning is a very young field, so the methods that they have right now, they don’t work very well. There is sort of a they are. So I think if you want to jump to the base, this is probably still a good time to jump into it because it’s like really at the beginning. So almost everything goes. There is also a different concept about learning, which is also a bit harder to, and like, for example, there is a learning in the sense that, okay, so there is a feature of me, my data set, and I want it to be removed. I just want to remove the feature. So this is more sort of like on the memorisation side, whereas, okay, there’s a pack. memorised it by myself and I just wanted to move this back. And then there is sort of I want to unlock a particular skill when I want to unlearn particular concept, something that’s a lot more vague. And these are very different questions. And you’ll probably need very different algorithms and very different ways of approaching it. So there’s all of this kind of ones that that is, is, is sort of useful to know. The other thing, and you probably will notice, this is my slice. The diner learning, typically, a lot of the literature is on, on, on, on, on, on, on systems without memory, like MMPs and so forth. Um, I think, yeah, learning literature, because it’s just like now, it’s focussing a lot more on Transformers. I might talk about this again. When you have memory, Continue, I mean, and I think we have a lot more complicated. Because now you have multiple places where information can be support, and you might also have multiple kinds of learning mechanism happening in parallel. So 11 mechanism is sort of the region scent, and the waves are the place where you start information. The other thing is sort of the memory stuff of the system, and that can be sort of somewhat more explicit as it is in the context of Transformer, or it can be more, which is, like, an additional centre, very family or effort. And we have, sort of, like, metal learning effects, like, something like happening as well. And it’s a user of complicating how all of these systems interact with each other. And just to put it out there, like I just think there is not enough to turn this. Like basically we don’t necessarily have thought yet deep enough how all of these things interact with each other. Well, I’m just, I strongly believe that there is a lot of interesting questions that you can ask of, like, maybe any of the standard questions have been asked or online, you got the female aligning, and ask sort of how the memory component of the system interacts with this. And sort of these different sort of learning time scales, right? Because the truth is in a transformer in any system of memory, you have at least 3 p of learning. The one is sort of what you’re discovering an inference, or like the income test on behaviour. The other one is what you’re doing with grade and decent. And these are different typ scales that they interact each other very interesting ways that haven’t been studied enough. I think in this working impact. And um, In terms of reasons you might have to be interested in it, but you’re I think the last one is maybe it’s just because you care about it, the systems, the biological systems do not rely on the way future systems run at all. You know, if you try as a human to try to learn the way LGPT does, you’ll completely fail. Like imagine, you know, just take a book and shuffle pages around and then start reading randomly. You’re not gonna get anything other than cooked, right? So it’s definitely, if your goals here is to understand biological systems, your goal is trying to understand how the brain works, this used to be, Judge Hinton’s sketchphrase, then, you know, the what we’re doing for neural networks is not good, like a good model of that, right? So even that would be sufficient as a reason of why we want to figure out continual life. So clear learning would require systems to learn sort of more like humans, though. So, what would, um, What would imply if we switch to what you are learning? This is sort of my, uh, The bite of my motivation are part of the way, you can train these things, which is like the way it’s continual learning is not necessarily an answer to all of the problems. Like you know, I mean, it’s easy to complain about the existing paradigm and saying how you got this wrong and it does this wrong and does this wrong and so forth and that’s wrong and that’s wrong. And that’s wrong. But answering about the kitchen is not the GD. And just because we have this alternative paradigm, contin learning, it doesn’t necessarily mean that he’s going to beat any. Like, it might still end up being extremely expensive. You I still and I’ve being as sustainable volume. I still end up being quite different from what the brain does and as we’ve all informative about, you know, brain work. But what I’m trying to argue is that, but at least it does, it’s a genuine perspective, where instead of you focussing so much on what is the performance of the system, you care more about top week, and the system can adapt to a change distribution. Um, and I think just that might enable us to figure something out. It might not solve all the problems, but sort of my claim is that, um, typically how progress is being made is you do this kind of change of perspective, now you start measuring some different metric than you are measuring before. And then certain things become a lot more simple. then become a lot more obvious and you get to make progress. So the whole point of continuing learning is that we’re moving away from, I want to get performance X on this, too, I want a system that will quickly adapt something new that comes in, by relying on what you like so far, and sort of learn something in relation to that. So I wanted to, um, to be more of a philosophical thing, but I like try to close to it. So I wanted to discuss a little bit why Why is important when you’re switch to continue lighting and you’re trying to reframe things to, like, e-emphasize performance, and then sort of have change perspective. So this is, uh, this line is about, so the Thomas Griffith, so it’s a, I think the cognitive scientist, English, from New York scientists, I’m not too bad for sure. But anyway, it’s a big name in that kind of space. He had this paper called Understanding Human Intelligence through Human Imitations. that I really like. I’ve read it, no, I haven’t even read the whole thing. I read through it. long time ago. So what I’m telling you might not be exactly what’s in the paper, but this is what I took away from the paper. So his point is that the way human intelligence looks or the way we behave and the way we do things. is because of the limitations that the brain has. And we should think of those limitations. I mean, let me refame it this way. Basically he’s arguing that this limitations act with a regular answer. So his point is if you don’t have these limitations, you will never find a set of solution because you don’t have the right frgorizer reization in your learning system. So you’re not going to end up with a new market. And he has an example that kind of stuck to me, I think, kind of, correctly. So he was talking about alpha goal. So, you know, at some point, do you mind we were doing, um, the Fargo system that we’re playing, playing the goal game. Um, many of you might know what good is, but it’s, um, it’s like chess, but by much complicated. I’ve never played in my life, but it has all of its sound, right? And usually this game is that the complexity of it is much, much higher than chess. It’s a very complicated game. Anyway, we tried the system, we had these competitions with the world. I will, this adult was supposed to be tuning soon into the game. Um and then Alpago one and it was sort of a big moment for demon. And when that happened, in one of those names, the GoA agent made a move. I don’t know if does move 4757 I don’t know, it about one of these more because all over the news everyone was excited. And the point of that move was that all the analysts who were looking at a game, they couldn’t make sense of it. They were saying this move looks like a mistake. Like it makes no sense to do this move. But then it turned out that that was the move that made the agent win against this at all, because somehow the move was filled out to be extremely important, much, much in the future of the game, right? So it was a move more in the beginning that everyone saw like, okay, it’s a mistake, but then it turned out to be a hotel towards the end and it’s like, um, um, to the mean. And everyone, like, were calling, while calling this movie was alien like. There’s nothing human about this move, like no human would ever do that movie, made no sense. So what someone is that mistake is trying to explain how this might have come about. So his point is that humans, when he tries to solve problems, the standard way we do this is decomposition. So we have sort of this divide and conquer kind of approach to solving any problem. So but you try to learn how to play go and I think this the same as we find this intermediate goals that we have to do like okay this area, do this, to that. And the way we approach this explanation in that space is through this, right? You can put the problem into simpler ones, the simpler ones you can understand. And then from those we compose a solution. The law agent is not compositional. So the boy, you can never tries to propose a problem in the cellball, or anything like that. It just tries to solve the problem. you know, from the beginning to go. And it does just doesn sense by good voice, right? It just does a huge search and you try to solve the problem. So his point is that move looks alien-like because it doesn’t serve any sample. There is no decomposition of your problem that will show this movement being a step towards something. It’s only a step towards winning the game at the end. But, like, if you need a little posit in order to understand it, you’re not going to get that. So this is sort of another high level recommendation. So he’s saying, this is, you know, human identity that move, because we can’t solve the problem in one goal. We have to decompose it and there’s no decomposition for this makes sense. So the reason why I found this explanation kind of interesting is because it points to something, we just feel kind of struggles with for a long time, which is sometimes we have some of these properties, like for example, traditionality, you need to pay a price. So another way of framing what, uh, Tomas Fitz is trying to say here is that probably humans will always be suboptimal ago because they have to decompose the game with sub problems. And then there is some strategies of winning the games. that don’t involve any composition. And those are not really sort of our reach. We cannot, we cannot do those strategies because we need sort of physical position to make sense of the game. So the point here is that I’m going a very long way about it. But the point here is that, for example, compositionality, which I think is a very nice property of reasoning, might not necessarily we call optimality. Like there might be ways of solving certain reasoning problems that are plants that are tracked there because but not compositional. So therefore, whether building the systems, like if we want a compositional system, we should probably most likely expect a dropping performance because there has to be a price to be paid for this. Like, compositionally has benefits, one particular benefit of compositionality is that it allows you to adapt very fast, because whatever you have a big problem, is you’re just repomposing it from video solutions. So that’s why it’s so fundamental to humans. Compositionality is potentially one of the main mechanisms that we achieve, we used to have very fast out of ability to new environment because we’re decomposing things that we’ve already known before. But in might come at a price, right? So maybe we’re very interested in adapting, but we probably will never be able to perform as well as a system, but not composition out that it’s only doing that one thing, and it’s just learned to do that one thing. which is, I mean, sometimes is not surprising see this all the time, right? When you have, like, we take a machine imaging system that is straight all into one thing, like only play chess, it’s always going to be better than any human. And then the issue is because that agent does not need to obey by, you know, that season doesn’t need to be adaptable. It doesn’t need to have all these other side properties that a human needs to be, in order to just be able to survive in the world. You know, otherwise, you can’t just do one, you need to be able to interact with other humans and all that stuff, right? So, so the point here of like this that I think, I mean, there’s multiple points in this paper, but I like to bring this point because, um, I think this is a been a career in appeal for a long time. It’s a failure for CL as well. But what we end up doing in papers is new. so for HRL, for example, you do the sample. You come up with this HRL system, HRL, basically, is very, you know, it’s a propositional structure, right? So you say you’re doing it for small learning, but you assume your policy is compositional, it has sort of certain levels that interact to each other. So you make your system, and then you compare it as a baseline to the system that is not HRL, that is trained on the same task, solve it. And it’s always the very side always does the best side. And I think this is the reason why HRL, my life, this is a failure, but no one managed to get HRL to work. And the reason is no one happy to work is probably because it will never work as well as the non-eratical system. It’s not going to work as well as a non-compositional system because there is a price to pay. Now, where it has become tricky, particular thing about papers and research, is what is a good price to pay? Like, if, you know, if I come up with a system and argue, okay, my system is compositional, but my system can continue learning, and I argue this system has to perform worse. Like, what is an acceptable verse? I don’t know. If my initial mother was doing um 78, if I get a system that only has 60% accuracy, Is that good enough? given that it can also adapt fast? That’s a really hard thing for people to drive is really hard to get a paper published in that kind of argument. But I’m just saying that this field, like in some sense, that because of this sort of benchmarking, philosophy, that’s on regular field, where you always get to compare to a Bayside, it doesn’t do what you’re doing. And the assumption is that if what you’re doing is good, then you have to perform at least as well, if not practice, then you’re very high. And I think the point here is that when what you’re doing is you’re adding extra structure that is meant to help you in an autopolitan way. So particularly compositionality, it should help you in adapt communities, then, like, you might have to pay a price somewhere else. And that price might actually be bigger than we expect or something like that. So this is sort of, I guess, the message and I thought you started like an interesting. It’s a bit one of the specific side, but I think it’s a sort of an interesting respective because it kind of highlights why the field has struggled so much with this kind of things. Like, compositionality, it is not as much a thing, a machine line. we don’t really have strong compositional systems that can build things. And it’s not because people have not tried. Some people have tried really hard and a cognitive science is a big theme. Like if you talk to anyone there, you know, this would be one thing that I would want to see from anything, that the system is somehow compositional and how we do things. It’s just you never got them to work well enough and this year is trying to say that maybe, you know, we’re looking for the wrong thing. we don’t we don’t know how to calibrate ourselves when we’re looking for this certain behaviour. So these are the one week later message that I got finished 25 and it’s also disabled me away, but justified. The other thing is the other thing I said in the beginning, which is the main pieces of the paper, actually, is this idea that the limitations of the system have, you shouldn’t see them as something you’re supposed to fight, I guess. You should see that on a regular item. it’s actually helping you. So here’s point, for example, is if you don’t have limited inference computers, so that he’s arguing that, for example, you know, do not have to give an answer. I don’t know how many, they can only do this many flops or whatever. you pay the br is limited. And and all of that stuff. right? So, so you have, so we need to compute that by training limited compute for, for, for, for new friends and how could we have to finance it, how many calculations you can do that, I said? This is what led to compositionality. So like because of these new rings, then when you try to optimise the system, the only solution that would do well is the solution that I have position. If you only remove these limits, like if you say like that’s the issue with Papa, right? Even if you want to make it competitional, you remove this limit because you’re training on the class that it has many computers and the inference, you use as much computer as you can, and you push out of this in parallel. If you take these limits, then obviously the compositional solution will not be there in your sex anymore, or it’s not going to be a optimal or other things. So his voice was that if we want to get AI agents that deal or behave more human-like, then we should train them with senior constraints and the humans have, in terms of computer, they have access to in terms of how they interact with development and so on. And as long as we don’t mimic these limitations, we just don’t have the right regularizers and we will never find exactly the same solutions.. This kind of he’s phosophical, they can generally say, he’s like, yeah, you should think of limitation as something that you should exploit, to kind of shape up your surface. Nothing is something that you need to somehow work around or like remove from the system because they’re just harming what you do. that’s sort of people better. And people agree with that. pretty understand but I have a question, right? For example, you have a road at a zero, but there’s so much training. they end up using the same operation as human sometimes. this is also like her with composition. Because this opening is certainly something very understandable. I mean, Yeah, that’s a very good point. I sort of I the way I would read that in is well, 1st both of the systems like has 0 number the Apple will definitely was train on human data. So it’s just the imitation kind of process that you’ve seen this many, many times and then you just like imitating it. None necessarily because you’re trying to compose the problem, which is because that’s sort of what you replied. I guess enough zero you li with but I from sc. I remember that was I putting pictures that appro. So maybe this argument doesn’t help as well. Um, I guess the point is that this statch that serve a role in sort of composing the problem, they might also be very good moves where you are composing the problem. And this is sort of the problem, the perspective here is not necessarily that the moves we’re doing when we’re doing these decompositions are always available. But sometimes there are other moves that will not, that you’re not going to have access to. You look by there. Yeah, it’s not my direction. I mean, I think in a lot of scenarios, and I’ve seen this, like, even in, You got to take this place and a bit more controlled. You can look at algorithms, right? And sort out learning development, you know, being like so it and stretch and whatnot And, I mean, a lot of instances, for a lot of problems, like it’s kind of invited, and some very, he’s optimal, like, you know, you can mathematically prove that this is sort of a lower bound of how people can sort something, where you can do this, right? And there’s no other way. And a lot of these algorithms are sort of, you know, have this sort of structure, we’re composing the problem, the sub problem, and we can. So it’s not that Compositionality is always optimal, is that in many instances it might be sub optimal. and we don’t always know when that’s the case. And I think this is, yeah. But obviously, okay, I presented this as ground fruit. This is just a position, This is just a take. Basically, my main interpretation of Thomas’s newspaper. Like, it doesn’t have to be true. So, you know, like it’s really perfect for it for people not to buy this particular argument. But this is an argument that I, I mean, I find it like compelling, but, you know, I think, This is the part which is, this is the hardness of the kind of stuff. It’s really hard to prove it’s too around. Like, you know, I don’t I don’t have a way of saying that, well, I can prove that compositional structures in Starcraft or whatnot, they always have to be some optimal, and there’s always a solution that means that there is not going to be. I can’t really necessarily include that, but like I’m suspecting that, for example, when maybe Alpha starts, inside of the system, they were staying inside that, they found the same thing, so they were trying to do this hierarchical adel system, which Jirai Palaro makes you a compositional colonel, and then never worked. The agentired passed was not the right, but wasn’t doing anything. The agent that worked best was actually something that did a lot of mutation writing. So just look at human player, try to even get exactly what they’re doing, and then root for us. And just start right around, like, none of the times here, team, we work. And I remember right in the day, it was a real disappointment for a lot of friends in D-Mind, because they thought, there is no way when you take a game like Starcraft, I don’t know if there’s no Starcraft, it’s relatively complex. There’s no way you can solve this game unless you create all of this stuff fast and you have all of this kind of compositional behaviour where 1st I can start my base and collect minerals and then I start building an army and I start doing this and that. There’s no way you can do this for truth for us. like truth force is not gonna work here. You’ll need some form of, you need some form of information. But the upper side doesn’t have any of that. It has a bunch of imitation learning and then just some training on a retourse, right? And there are some tricks in the paper, I’m making this sound of my personalities. There either be some nice stuff in that paper, but the B pulps that people had that did not have money, turn out the same ones with Bafago, we turn out the root course, work really well. And I think on the back of this thing, you have like the bigger lesson from restart and just kind of in that train. There’s like, oh, you don’t need anything, you just need scale and it’s gonna work. And I think you’ll probably be better about half a star and are some other things that will happen in that interspace. But I think the reason the bitter lesson works is because we’re only looking at the performance. in some sense. I mean, you can interpret it this way as well. Like, obviously, if we only care about performance in domain, you don’t need anything except scale and data, and you throw it at whatever you actually have. And so far, you are empirically, that’s what worked the best. And it’s always going to work. But if you care about something else, then this might not work anymore. Like, for example, if you care about the whole degenerization, I don’t know, some weird symmetry that you peak, whatever that is, I like, you know, like destroying data is not gonna fix it. Like example, you let me give you an example I had for example, I have. My people in the, yeah, so so you want to learn to soy to this the website at night, and then you want to generalise to my computer. You can make the model as big as you are, you can put all the data, all the possible list of lengths, K, is still not gonna generalise to, like, 10 K. Like, it’s because it’s – it’s a sort of inductive bias that you’re not gonna be captured. like it’s gonna work perfectly in themain. You make them all the cleaners you want. going to be able to sort any list that you want that has K eleS. He’s not going to generalise. You’re always gonna find my point longer is where the system is a little bit. Even no matter how deep you make it, no matter how much training you do, no matter what things you do. I mean, obviously I have done that with experiment, but I’ fair to something that there’s going to be the case. And that’s because at that point we’re not looking at in domain generalisation anymore. So the kind of generation we’re talking before, you’re looking at something else. you’re looking at this sort of extrapolation behaviour. And this extapolation, I don’t really just going to be solved by state. This is all my taste. the bit lesson you either familiar with that low post from Gaten, there’s only true if you stay in domain And I think that is a mistake that a lot of these things have done. They just say, you know, I mean, um, this becomes, this is a side note. This becomes even more painful when you talk to ourL folks. So I had a the opportunity at some point to go to restart. And one thing I found very surprising is he said he doesn’t understand his sponsor, but test. So you’re saying in everything is training. Like there is no test set. There is no separate thing. just one thing, imagine yeah. So that makes it even harder to discuss these kind of things for me because huge well, he doesn’t even believe they exist the conditionation, but he’ only training there. So try to convince him that there’s even something after in manualization, it’s even a harder because for him it’s like, no, there’s only one thing and you’re just learning, listening and that’s it. and you have access to all of it. There is no, no test like, because in a relatively, there is no, like what you are training on is what you’re avoiding on. They’re only one environment. It’s not like you have a training environment and validation environment and you’re training on the training environment and taking on the violation environment. not this sort of separation, but I said, of course he was trying to be, I don’t know if you ever met him or like Joe, we should have a chance of talking, but he is a very kind of trollish kind of person. I’m pretty sure he knew what I was talking about, but he was just trying to get me annoyed. So he really likes to get people annoyed and he likes like digging in some position, even if it’s absurd and I’m trying to depend in all kinds of technically correct arguments, but that his period don’t mean anything. So actually does a lot of that. Uh, let me check the time. So we have several more minutes. Let’s see how far we can go into that. Um, I spent a lot of time on previous like. So this like, maybe I don’t, it’s actually not as important as a previous one. When this slide is saying, one way to think about this problem, one way to move away from this, if we have some kind of magic metric, and I wrote here sort of like a type of, this magic makes no sense, but like if we would have a way of not looking into the performance, but waiting the performance by how much it took us to learn, you know, how how expensive the infant process is, all of these things, these are kind of the constraints that a human brain might have, right? Like we bound it. I mean, like for humans, you might be more of a heart constraint on a self-constraint. Yeah, it’s more of a substance, right? But basically what I’m trying to find here is that what these kind of things suggest, is that what we want, if we are to make this into a single number, what we want is a number that takes into account all these additional costs that are important, right? Like, for example, how much high quality knowledge already is in the system versus how much or did it have to be suffer for data? How how long do you have to learn? How long does it take to do interest, to make any prediction? What is the data even allowing us to learn? But what’s one information is within the data, one information, and the data? So the idea is if you have an objective like this, the claim here would be that you will find very large number of architectures, that would be much, much better suited to discontinue any problem and it would have very, very graphic, different kinds of properties. And I put this as sort of like an aspirational target that, you know, people in the field might have. But obviously that that, that, sorry, this formula is not fancy called inner. We don’t know how to measure any of these communities. They might need to have certain ways, and it’s not, you know, clear to me that it needs to be very proportional to this and proportional to that and so forth. But I just put it there as a, you know, this is sort of where the issue is. Then we just look at the performance and we’re not correcting by all of these other factors, what you might care about. This is why the slide was meant to say. So it’s a bit of a hand by bisa, like, is trying to kind of claim that part of the problem of why we are, where we are, and we have this algorithm that apparently working in me well, but are limited in all kind of ways, because we don’t have people account for these other side properties. So the answer of the community to all of these problems, it’s sort of, by the way, I keep using continual learning, so this is the training that I like. It’s actually this game about, like, the fear of this serves, so their paper is about it in the early 2000s, maybe even the 90s, um, there were some like PTP this year and there on this and some paper and that’s it, but there wasn’t much talk about it. And then it kind of exploded in like 2017 or something like that. And that’s when it kind of became appeal. But the issue is that actually, it is not in different places independently, and then you have continual learning, you have lifelong learning, you have never-ending learning. You have open-ended learning, you have, and at some point people don’t actually know what the difference between all of these things are. So you have like 7 different names, many times they mean the same thing, sometimes they mean completely different things. So it’s a very, at least for a while, it was a very massive yield. I think nowadays, it’s becoming more and more consolidated around continual learning. Like, it really helped a lot of the fact that the wealth folks started calling, they continued learning a lot, because they were telling me, platform dining, and they found the lead community, and then there was always a piece of the, you know, these people calling me, this, is propelling in that. But anyway, it’s sort of like a feel that evolves organically, but it’s also a appeal that is, um, It doesn’t have a very strong kind of like core definition or, you know, like if I’m taking other topics, usually, you know, there is some formalism and some framework. There’s some book written by some big name somewhere that is setting up like what this field is about. And there’s some formalism, there’s some definitions and, you know, everything else follows from that. Material learning gives a lot more fluid. Like different people will give you very different definitions. Those definitions will be contradictory sometimes. They will fight each other. And there is no background through. There is no like established entity that is like, okay, this is what continue learning should be, and this is what we should do. So this is 11 aspect of it. The other aspect is I wanted to say is that as you start playing with the systems, I am very familiar at the moment that there is not only one continuating problem. You know, people usually claim this as they have the criteria, any problem, and they’re going to be the casino. But there is more like a sp of problems that will require fact more different solutions. And actually, it’s not even clear to me that that solution would be, like, I’m pretty sure the solution would have to be by defence. So I’m just doinging that app let you know that, you know, where where all these messiness is coming from. But yeah, if you want the initials and these are 2 possible destinations, one is a simpler one. you just say continual learning is when you don’t make AID assumption. So when we don’t assume there is a starting distribution, whatever you do then, then that’s what you know, right? A pretty wide definition. There is going to be things in here that I’m pretty sure some researchers will say, well, this is not continual. I mean, this is that’s whatever that I think, because it’s a I grew up and added, uh, like, for example, the prequestial writing stuff, right? doesn’t make the idea assumption. I’m pretty sure Marcrosata or your good map say whether it’s something pressure learning and this is sort of doing. It has nothing to do with your ar team. But that’s kind of the point. It’s a very like thing. Another way of defining this is just closer to what I care about and I’m going to show you that this life is that it has something to do with the credit assignment mechanism within the neural network. And, um, and then trying to fix the creators type of mechanism, maybe I need to explain why the creators I’m imagining remains. And the crazy side of mechanism, what I mean by that, is what you’re doing, lighting, and you have a network, you observe some error, and then you have to decide neat ways are to be blamed for that error. And you have to correct those ways. This is the kind of side money problem. And usually we resolve this to grade then is that the way we decide which way we need to analyse, that’s how much of a penal lighter is done at grade and design. So my, one of my personal take on continual learning is that continuing really about understanding weight and understanding, and how great the design gets, this great assignment from. But obviously this is a very narrow definition. There is a lot of things people can continue learning, like for example, this memory management stuff, that will not fall into designation, right? Because this is, this is that you should make it sound like a cleaner. it’s really about creat desent and how great incent works. And obviously, there’ll be lots of people like Raha, who are talking about right before, who would disagree with that and say, well, no, like this, this name with memory is not a human, I mean. And I actually agree with that. Um, sorry, it’s important why I did. Yes, we’re all of time. So I think we’re going to suck here. The lecture was moved in the afternoon. So I think we’re going to have another state in our independent, correct?

LECTURE 8:

Yeah, I don’t usually teach, so I was kind of like struggling to figure out exactly the structure of words. So my goal was to try to help where what I think are the fundamental aspects of the inviting, that if you understand, is going to help you navigate sort of the super rich teacher, is just not there. So that obviously means that I’m trying to cover a lot of crowd and then covering it is sort of very investive in the way. So yeah, I can appreciate that it’s like I just want to make people well be a bit more comfortable. Like, I’m not in the effort, I even give more, like, even the homeworks, like, they look for the homework, like, very scary. I don’t know, I don’t know how many view it is. Like I’m not going to be very gratical but when I’m very these days. Like I’m just sort of wanting to see that you’re putting an effort and trying. so to understand to a certain degree the kind of things that I talking but you know, like don’t don’t more stress Like, don’t, don’t, uh, and I’m not going to expect you to do a lot of magic things in data. I will I will try to have a past next week. Actually I haven’t done it but I heard it. And then that would that would be a long test. So I’m after better time account for your great, but it it gonna be so you kind of have a sense of how that my little life I to know what the expecting that I agree when when you have the final job. which is gonna be like most of your grade. I think we said probably one of the but means was like. But yeah, I mean the work here is really hopefully you’re going to get some patience about systems and then you kind of know how to reason about them and how to navigate some of the Like I, you know, the details usually doesn’t matter. But like if you have the right kind of thinking behind this, that will make it so much easier. In the future, if you decide that what you’re trying to do is impact a PHD. Do you have this discourse of what PHD is on a master level? But like if you decide you want to do a PhD, you know, you already have a house that is covered in this. person you kind of know what doesn happen, right? So the other thing that is stuff is, you feel about migrating, what technical problem is about the PC. So I’m not going to be, there is three, four. We do super hard. Partly for the test, because I know that sort of, for home course, you have time, you think about it, but that’s really. Okay. Oh, I wanted to say that. With that before I, I like any questions or any does I say need to be straight to the slide for any passenger bring out before we jump into the into the that. And now that you can go back there. So, um, where we start was discussing sort of definitions about the media writing and I was trying to make this one that there are many definitions, not all that are near each other. there are even different names where they feel these after saying. popular different names like when learning. the nonsense that I was mentioning all aspect. Because I organised that is kind of centred around these kind of topics, even the conference is called the conference or lifetime changes. I don’t have know any of that. So, just, you know, there’s a lot of, these, these are not going to be used, they’re changeable. Oh, did you change? change another action. No. Okay, so that says that usually the way particular learning is introduced in most papers, it’s sort of like a way of having a another picture of how people are thinking about this. It’s usually natural this there are. So the idea here is I have a system that is lighting on a CPI, some tasks, one after the other, and then I’m expecting the system to behave in particular way. So, um, And that sort of one question is, okay, how do you formalise sort of this non-ID data? So this is done as a sement of task. This is by 5 difficult. and then you have choices. how this past traditions like am I jping for transform with us two? These are sort of like a small transition and slowly moving from thatform class 2? are there repetitions of the same task within the sequence I have to do something do something else and then go back to the list. work in the task plan, this all are so, in terms as hyper parameters, about to find the, the particular learning problem that we want to solve. It’s how we set up the problem. And they matter. So that’s why the field is so messy because like he might have discrete task boundaries. so I do one thing and I switch the l. I would grow up by a different kind of outlider, then if I have the sort of no task, but I have this sort of continuous grief is what I’m supposed to do, right? Imagine, you know, like you have to classify images that I take in over time and the weather company and slowly, you know, see snow and the snow disappears and sse seeing sun and all that. There is a that is a smooth tradition, right? Weather doesn’t change your right. At least should sometimes it does. So, so in that case, you know, you’re the guy, different kind of popular compared to a situation where where tasks are completely different, like 1st tasks of and this 2nd task, do you need that or something like that? So, um, Usually people are not very explicit about this. So the way this is usually done is by very fun. So, but you know, writing has any other field, the way it grew and the way it’s been so far is it’s kind of a benchmark equipment where you have a bunch of additional landmarks. And those menmark calcul creat a particular ways. You know, they have some particular choice on the task that you do those tasks to be supervised, they are supervised to the RL and they have certain types of transitions. Okay. And then we have a list of these data that you’re expecting from the sister that is learning on this sort of non-IM data. Um, and um, there are kind of stuff from things that, um, I want to have email access to the previous tasks. so that there is this transcept that I’m kind of move forward in time and once I’ve seen that one and I go to that 2 I don’t go back for it, right? I just only now seen the new task. I have fix computation things capacity and computation. So, you know, this is like an extension that doesn’t magically grow over time, even though there are many other going to do that, but ideally you want to have a bound. You’re thinking, okay, it’s about it, agent. This is sort of all you can do and somehow it has to deal with this sort of sequence of conversations. And then you want to do things like you my status projecting. So what I mean by actually forgetting is sort of observed the difficult effect that if I learned past one, I said if I learned and this, and then I switch to like C five. the 1st thing that you would observe that the system immediately forgets also so well. So as soon as you start learning something else, the 1st thing that you do, you completely break whatever you’re doing before. So this is the seen that paros for. And usually what you would expect is that a system can learn a new task without actually forgetting the previous one. So I think one of this what is supposed to show kind of sorry we’re forgetting is this one, basically the idea is to solve that one like so does 2 gl goes down and you go to search soft that rent those. So there’s this idea that, as soon as you find something else, completely breaks, you want to maintain plasticity. So the other pl here shows on my system means. So this is this is one pathology that you see in practice, this is a 2nd pathology, that is in practice where what happens is after you learn task one, you’re just unable to learn anything else. So usually a laptop goes into one of these 2 states. Either after it learned something or you started something else, you know, maybe forget what you did before, or the opposite, you just don’t like anymore. And then we don’t forget, but we also don’t like. That’s another way of solving the industry problem. So this is sort of thing that you want to avoid. But what people want to do more is that they want to maximise over the life by transcribe. Before we transcribe, connecting to class, we forgetting is basically opposite, which it says that if I write a task, say, there’s an example of that. I say you are some sport. I don’t know, football, you learn to play football that you try to learn to play basketball, there’s skills that many can be shared because you know sport I have no idea as for that. I’ not go where one of us prepared. But assuming there is, you’ll expect that when you learn their seconds for you’ll actually learn it much faster. The idea is this sort of compositional kind of aspect of me that word discussing area So this is what people are actually looking forward. So they’re not necessarily interested in just saying there’s not going to be any kind of something forgetting and the system will maintain performance as it goes, oh, but they want to see this speed up in learning where the more you know, the faster you try new things. and this is something that most usually do. Nowadays, companies, do the the the LLM reasoning, right? So it consumes a lot of compute. So was there any research that was done to try to make use of previous reasoning traits that were done. Uh, So you mean, like, Okay, so… Let’s say, let’s say I give a I give a poem to ChatGPT and I and I turn switch on the the reasoning mode. If I ask it a similar task next time, it will go through the same reasoning steps. So it doesn’t make use of the reasoning phase, right? So was there any research that was done to make use of this reasoning choice? So that, yeah, in some sense, yes, in some sense, no. So there is like, obviously not, at least not deployed. Not in my system, because the system mistakes. And usually depending on the terms of service, they do not collect data when you are meant to interact with the system, they probably do, but, you know, depending on the data. So I don’t think there’s any kind of explicit machinery where it’s looking sort of on this, but I think people have been playing where you’re saying, okay, can I disteal this reasoning traces back into the rates so that I don’t need to do that as well. So there’s these are important that I don’t have a very wide mind on my head. But the outcome of that is that it helps, but it doesn’t fully work. So the reasoning process itself, it seems to be doing something different. It’s not like you can’t really fully distillated ways. It feels like having this sort of freedom of doing this, it’s helpful. And like maybe when I hear that, I might end up with a bit more about it, but it’s the same reason why for Alpha, go, you can’t get your MGTS and say I’m going to be still listening to the policy directly. This is like basically like a form of search that you do like the inference. And I’ll give you my own personal head, well, this is not this, this is not published work, but I’m expected, accepted explanation, but basically this reasoning allows you to have access, kind of functions, you do not have access by learning, because learning forces you to learn functions that are smooth, deciding that you buy some new effort. Then when you have this kind of explicit search, the inference, this allows you to access things that are less human. I think there’s sort of like a fundamental difference in sort of behaviour of the system. So, but yeah, I love all the answer is you’re going to improve the system a lot, if it is still, this reasoning traces in the model, but you’re not going to achieve the same thing, then usually that’s why, because if you could have, it’s cheaper for companies to get rid of this and just give you the non-reasoning system, because that’s your proper time. They could be still all this reason in the model they will but it does doesn give you the same performance. Yes. I’m trying, after that, regards to the difference between continue learning and metalism. Is it, is it, like, to which initial voice on the update in view and impress? Is it the normal initial parameters or is it the one that are updated in the last step, on the last 3rd and last? Because, um, I’m assuming in person I learned, I adapted, that’s, and then I think there’s new initials, uh, more parameters, and then I adapt with the next stuff. Because then you metal, because I’m always up in the restaurant to the initial metal and in which… Yeah, so they will use or you think ratio, like it’s perfect. So, so, when you’re learning, there is some, I, I, I, I, So, okay, so continue writing is exactly what you said. You’re basically another way, maybe maybe this amount is simpler, the problem is, you’re doing a series of antiquities. train and countries. In metal learning, there’s not the case. Metal learning is more like m task gliding. You show it all the tasks at the same time that youian scenario. The only difference is that your goal is not to learn all the tasks. You’re showing on the task at the same time, but your goal is to learn an optimiser that will help you to learn any of those tasks much faster. So you are basically by showing all of these tasks, you are trying to extend what is the common structure of these tasks that will help you to learn to tell more pastor. Just to give you an example of a particular family of a talent avoidance in my making is much in blood. So I don’t know if you’ve had a mammal mammal or about big thing. So that’s a. I actually I don’t want to explain my number. Anyway, so, something similar to drama, like you, there is what does it cost? That’s those are people. So you can imagine for example, that you want to put precondition rate that you said so you got about with commissioners. So you want to be a free condition potential mat. You don’t know what the inter should be. But you have a series of tasks. And your question is I want the time to get started I can learn any of this task pastaster. So then once you’re doing it, if you’re circling tasks, and then what you’re computing radiants in respect to the pre-conditioner matrix, and you go to the pre-conditioner. And this is a metallic approach because what you’re learning is in some sense, you’re an optimiser. I mean, specifically you are just learning this recognition on athletes. So it’ss a very kind of arictctive form of. Like in mem you’ll learn it. It’s the same thing, but you know, instead of saying, I’m in a lot of air conditioner, you’re going to say I’m in a life, an initial intalization of a model, such that anytime I learn, any task is going to be easy to life. So this is kind of a difference. metal life, you’re trying to learn something about the learning language itself. So that’s why someow you are popular to call this learning line because it’s then in the right half. So you any propert is a place. But what you’re learning is you have access to all your tasks. And now that you have to do this questions. In the writing. you do not care about I mean, you do, but it’s more fit around a solution rather than the letting up better yourself. So if you’re looking for Ford transfer, you could add that there’re still trying to change the learning processes go forward the time because that point was not pass. But then the youngest, it’s really about this sequentiality of the tasks. You never have access to everything. You have to see that in sequence and you have to go backwards. You dance something wrong you done something wrong. You can’t go back incorrect. This is not one of the differees. But yeah, there are There is a point where everything become very, very similar. So like in absstance I completely get for a very confusion is coming from because there is a point where a lot of people are arguing that not learning is before we continue learning the clear learning form of mal learning and all of those kind of stuff. So they can become arbitrarily close depending on how you formalise them and how you tell them. but technically then I reset it. And particularly particularly it’s really about not being IND, but not learning, it’s all about figuring out a way to speed up learning by looking at the data. Any other at this point it has made my access to the team yourself. Yes. Why is it like that? So, okay, so 1st of all, but here, I think people are usually kind of peak which of these things they want they care about, and you can draw a certain amount you don’t care about. But usually, for example, if you have like remotely to do that example, that’s given here. So you have a robot, that is, you want to learn as an expert, like you said, a robot to the moon. And obviously, you know, the product cannot communicate to us to go for past step to learn things it has to learn only life while it’s there. and then it has quite capacity has, you know, finance and so forth. So one thing you can’t do is you can’t just score everything that you see and then imitate IAD learning a sound body, you and you can’t go back because, well, that is reversible. like even if the model wants to like turn around it’s not gonna be exactly the same observation, same same situation. So you know, information comes in, you can’t start it because you don’t have the capacity. So you need to learn an average of the 10 continuously learned without needing to go back and look at the tasks. Like I say, anyone, because what you usually talk about is keeping some observations. So you have like a replay buffer of swords that you give some information, but this replay buffer intends to be small. So you can’t just score everything, you can store some important observations, if that’s somehow helpful, but you transfer everything. And you don’t have to, yeah, approach this problem of like, which are the most, say, yeah, which are the most important observations that you need to store, which I think that you can throw away, how you decide is, these are all kind of in the space of the field, the other way that the people are doing. Yes So here lost. Yes. And then we have to add some terms, for example, for the 1st company to, like the term where we can minimise the access to this task, also observe to minimise the important general system. Yeah, so we the same Yeah, yeah. So we’re almost not talking about solution. but yeah, like this is the rather this is what you want there still to be able to do. How do you do this is sort of up to the upper one. But yeah, like for example, for the, for the, for the, one particular approach is you are an additional turn for the loss and the regularisation, that is that the benefit from forgetting. So you have turns to, but it doesn’t have to be ideas. So for example, another way you put approach is problem of forgetting, so I guess example forgetting that’s usually easier to understand is not the best part, but you say I learned that part. No, I pleased this ways. I add new weights and I just learn in the new weights. I mean, this is not a proper solution, but it will be technically a way of doing it, right? So you’re not forgetting just because you’re not allowed to price this space. And, but obviously, it just means that you have your car under the damage. Sometimes it’s those that can have point number two, there’s not, you know, an ever knowing amount of different size. But it is a, in theory, it is a potential way of doing it. And then there is a, you get no variety of proper aligners by saying, I learned. I compress, and then I add 5 metres and then I can still learning. And this way you found sort of how much of a number of characters grow because you have this compression that. so you know and and that that’ll be another. So I don’t know that kind of makes it clear. So yeah, most of the time this thing will turn out into additional terms as you are, but if you actually want one function, it has all. somehow to keep us things between. Because we. You can keep the weight if that contrast. Yeah, yeah. that’s what I said minimum. I guess you can have a small exles or you can have the weight. I mean, it’s just sort of up to the benchmark to describe how much you can keep. and but you need to keep starting, but otherwise it’d be impossible, right? I think you have some sense of… How do we define a lot of specificity? I’m inferically saying it’s performance and metres. Um, So, okay, so my system is a bit of attractive. So there is still a definition of plasticity, it really is about whether you can minimise the loss of the data, but that’s, I mean, okay, so there’s participating in your science and persistency in machine. I machine when we say, I have lots of acity. I really be looking whether the loss is going down because I’m trying to organise. There is a beautiful vari of plasticity which is when I learn on your task. My training error goes down, but my validation loss does not. And that didn’t happen as well. So there are 2 guys. You either you cannot optimise so my loss does not go down or and I talking with that but that’s going to have to learn out It depends. So, um, You don’t say you need to have like that to see that you’re going to have plasticity. It depends on the properties of the model. So usually, for example, you lose plasticity when the all bec real conditions. So like if you computer has yet and you see that the model, the spectrum has very large and very small lighting values. And you’re using regularly sales, you know that you’re going to be in trouble. I mean, if you use a premission at correct for this disarity, you’re not in traveller more. So to be fact class basicallyends on what you’re using the life. like a system, for example, my using plasticity I made in itself, but might completely plastic if I’m doing matter of radium because natural gr andrect the temperature so that the amount gets still make a lotoss go down while they have a centre to be stuck. There is sort of kind of very drastic variance of this where, for example, one into the plasticity, you get all your, you have level models and all your level units without that. And they don’t problems. And then you close specificity because well, you know, I somehow have very strong biases. I learned from the 1st task that all my biases for this layer like minus obedience. So no matter what input I provide, the amount is going to be small, then the amount is going to be zero, I’m not going to have a radiant, there’s not going to be anywhere you have. So no matter how much they touch all the systems, whether I use natural rate or whatever automatic I want, it’s not gonna move because the gradient is an exact zero. So there’s no signal which way you’re supposed to move. That’ll be another pathological form. I mean, it’s not going to happen in practice, but it could happen in a way.. Oh, yeah. I’m not sure if that means you can do the same attraction, the same data, and there’s some more, let’s say, to hear your learning, friendly optimisers, that other optimisers, that I can get a very good model in theory of learning, with very high plasticity, and for the same model, the same data, I can get a very good model in theory of learning. Yes, that is true. So you optimise a calling sound well. So if you look at the literature I mean, maybe the one is it, so the tomato is obvious in the future, is that everyone uses tablet. So when people talk about plasticity, they don’t have quality of theato because the assumption is that everyone uses. And then like the init has to come from somewhere else. So, like, for example, if you are Hashian is sufficiently in condition to make Adam struggle, people will say there’s lots of assisting because, you know, they will never run, I don’t know, fancy optimised, right? But defult means run. But it’s true that by just changing the recogniser, you will be artificial. In fact, it’s the same, it’s true that’s why changing the architecture, you get sort of very different integers. And actually later in a slide level show again, I’m teaching this before I keep strong sense, but there’s a credence that use residents, but as a matter of fact, they have a speed connection and all that, versus BGG architecture. It doesn’t have ice sk connections. And you’ll see that we have completely different things. So like by just choosing the architecture, yeah, you change all of this propertyerties. even if you haven’t changed how you learn anything like that. Um, So number 5 star, I wanted to say, um, backwards transferred, maybe that is a little bit less than three. So this happens for biological systems, which is the idea that you learn to ask what, and you learn classes too, and because there’s a relationship, then that’s what I trust too. You actually become better at task for. You’re learning task. even though you are not actually doing fast. So for example, yeah, I don’t know, like doctors like play video games, because English their say the energy become better likeical practice, right? So even though you’re not actually doing anything, there is a very good task one because there is this relationship like when you task to like we get sort of improve our task one. So this is by far something that I almost never see, which is likely. learning never happens that you learn that’s why I learned after then you suddenly up better but learning for. But this is something that people would like from the system, right? They would like to have the property to be able to improve on things that they learn previously by learning something new. That’s a collision. In this particular, like if you want to see this kind of effects, there is to be a profession, this interest. If the task type of thing independent, there’s nothing to transfer. there’s no concept of So this is actually a very good point. All of these effects, compared to the data. if you have tasks that theyographing in front of each other? There’s no point in expecting the power transfer. There’s nothing to transfer, right? There’s nothing. So this is another one of those silent parts of the definitions that usually are not made. So it’s really, like the extent to which you can get any of these properties will depend on like what is the common structure between all these different tasks that you’re supposed to exploit in order to do this or that? And the issue why this is hard is because in general, we don’t have very good mathematical toolules to describe this. So if I give you For example, I give you an example. So one particular bench market that is quite popular is called career. So what it does, it’s a, it’s a 10 night presentation tasks, you know, to CPR or whatever, where you get a bunch of objects, right? I think you have things like phones, laptops, buses, cars, whatnot. And the point is you have pictures from the 70s to the kind of day. And that’s sort of how the notationity happens, right? You start seeing pictures in the beginning of the 70s, start those by, start seeing pictures from today. And there is a natural brief. There is a transformation that buses and cars have been gone through these years in terms of how they do. But I don’t know how to describe that. And I would not know how to explain why I would inspect by learning about our phone in the 80s. Why would I expect a machine now to be able to detect the iPhone from the 20, right? It’s not clear to me. I mean, so basically the fear what happens a lot is because we have these benchmarks, where they kind of make intuitive sense because it makes sense for us, like, you know, but like a person who never seen I mean, I don’t even know if people would have this kind of transport. But usually, like, basically the button button is, we usually don’t have the mathematical photos and for this, or anything like that, to help us say, like, even when you build an algorithm, you build the algorithm, you run it on this benchmark, you see for it transpired, you say, well, it goes to the for transfer. But you rarely are able to say, this is the structure that the algorithm is exploiting, and that’s why, whether for transport, and you can have a different benchmark that has the same structure. it’s going to work, but it doesn’t have the same know work. So why is this important? This is important because if you have very different benchmarks and you run the same algorithms on all of these benchmarks, you can get different rankings. You can get this algorithm is the best, but this bench part, but it’s the worst, but this one, because they’re exploiting certain silent properties of the data, and those properties are not respected across webstarts, is not going to work. But you know, that’s Like machine learning is a very computerical deal. Like, you know, basically what I’m saying is, yes, you’re complete. I mean, just to your point cap, the data matters a lot. It has a lot of these impacts and how the algorithm is going to behave. In practice, you know, they have the tools to describe this. We just have algorithms. They generally are quite robust across benchmarks. And we have benchmarks that are typically quite a line in the kind of structure that they have. but just by construction, not anything formal. Yes. So if it is already corrugated, a country use this corrugation of data, like a form of prior, I mean, one to continue it and other stuff. We definitely do that for collaboration on its own. It’s just quite about your information. So, like, at least like you’re going for a mathematical thing, correlation would be just 1st sort of information. And usually this kind of structure that is shared is not linear in nature. It’s not India. So like it’s not going to be exposed by a single correlation measure. Like a relationship. This is sort of the the issue with this field is that it’s all about intern right? I Like if you be in the linear space, then yeah, we could measure all these things are good. Like, there is, uh, the online learning teacher, which, again, like, my, like, is one of those fields where some people will argue that Virginia, like, you said, no, my learning, with the exception, the normal writing, usually it’s way more theoretical, and not more precise, mathematical, and it’s usually working on linear models. And there they have bounds or how much you transfer, they can measure all this because everything is linear, everything is whatnot. So you can you can work this out. Continue learning, it’s almost like online learning, whether you need networks. And here like nothing can be done mathematically. I mean, obviously we can use correlation and that’s going to be a good study. point. And there are things that look at provision. So first, I’m just saying it’s not sufficient because like if I go back to this example with phones through the years, Cor relation laptop it doesn’t have an explain of this image of a phone is related to thatmit overall. This is not a linear transformation. It’s much deeper than that. It doesn’t matter. So so this is kind of the kind of miserata. There is another 6th one. This is usually you’ll grandly find with inapers like phot talk about it. But I put it there because a lot of these things that people ignore, but I think it’s good for people for you to know that it exists, which is a property that you would want from a system like this is to be aware that the world does not expect for you. So what happens in all of these systems is say this is for like robotics, right? You get any observation and have your agent. I mean, this happens in another, another is territory. You have your agent that is playing a game, right? You get observation and I buy that place for you to decide what is your next action. That doesn’t happen in the real world. Like in the real world, everything happens around you, whether you act or not. And you just have to act very fast. But in, I feel like most of that knowledge I remember you have, then they will wait for you. And why is that important right it’s important because a lot of descriptionions that d items will start making the difference more and more expensiveive. So you’re going start as up waiting forever you have an action take because you have to put your dog kinds of things to make sure that you’re not messing after previous tasks. And that’s another the realistic algorithm because you can’t become slower and slower as time goes by in the way you have in the work. You need to always react, like you have, you have a big side of the computer, yeah, otherwise they go back. And this is another per say. that usually people in mind because in most artificial internal environment that we use, we set up that the environment always awaits for the age of decide what action it comes to take and then says, what’s going to happen next? The other thing, this is a lot of fact, And this is just one sent in this life. The other thing is maybe it’s not as the super clear always from the from the list that we had before. but some of this comes some of theseata they consider each other as a fundamental in the someental way. So for example, You can’t never forget, but have things capacity. Like, you never forget, it means you keep accumulating information, you know, like that doesn’t fit in a fixed capacity mode. Like, you’ll need to forget the sound. Otherwise you run out of capacity. So you can’t, you know, remove that as only forgetting, but have these capacity. You need to decide you try to okay, we try start day. And if there is like other things like this that are kind ofract. I feel like in your because when I’ve been discussing that he can em team run the this is not a matter of clical capacity or the motor have. Yeah. So, what is preventing the most amount of life, like, presenting E from learning all of these tasks given that we have creating like a basketball. So this conversation doesn’t play out in practice when you’re doing a resident 18 on, say, like a series of C5 tasks, because as you said, in other capacities. I’m just saying in general, these of the level, these 2 things don’t kind of fit together. Like in this scenario, this is not going to be what. And most of the time is, as I said before, the capacity is not going to be that either, you can do something. But at least conceptually, you need to agree that, you know, these 2 things don’t necessarily kind of make sense as as, um, like there needs to be a trade over at some point between this. I had one way in my. yes, you are right. It’s it’s a question of, of a specific life, right, especially architecture, whether you need to worry about this sort of strong interaction between different things. Um, So, yes, he’s doing a different thing. Okay. forgetting that very I’m Right, right. Isabel like you have moder information has is already updated and they ex business. us. Okay, so Yes. So there is I mean, I guess a people will call this writing when you So it’s okay, so I guess maybe the point of you’re trying to make is that sometimes some of the information that you have it’s out of date and it’s not useful anymore. And you want update that we’re doing information, right? And like at least, according to this definition, here, there will be a form of castle is forgetting, because we are removing something that you don’t replace in something else. But in this particular case, it’s something that you actually want. That’s how Because it usually was, you know, just try is that there’s all learning. So the difference may be between unlearning and catastrophic forgetting is that usually when you have these kind of situations when you want to replace information that you have in a system, this is usually a targeting process. You know that this fact is out of date and I want to replace this. The way, I guess, when we talk about that, sorry, forgetting the way this is usually natural in practice, you don’t have any control. like as soon as you start learning something else. like for example one topic time, right? You add some data about the new tax, spect example the the capital of France is not Paris or you switch to London, right? Then you want to find with your model to know that capital of France is London. you might correct the previous knowledge that the capital artized, but you also play other things suddenly forget the capitals of all other countries and everything. So this is the kind of we forgetting. like you don’t have any control over what you’re doinging from now. Well, in our learning usually spend more as you are like surgically going there with a particular piece of information that you want. But yeah. you can me I would just ask a question about, like, for kids, I’m like, is there anywhere easy to have in randomly? Like, the model is focussing randomly, information animals, or there’s any way to decide which information you can wait before it, like, right? So yeah, so this just goes into the of this learning field it’s all about like how you forget specific things and then they like specialised unated into that trying to do that. This is like a new deal more or less. So I think I started hearing about I’m learning, I don’t know, maybe like 3 years ago. I don’t think it was a big field before that, I mean, 3 or 4 years if I’m generous. So it’s a really new appeal. But now, you know, there’s a workshop on learning and there’s like groups that are just doing on writing and there’s a lot of research that are doing this It’s actually a really big problem. I think particularly now in this world of our labs. It is very important to be able to take part of the system and be able to make it forget specific things. It it’s good for eliminating f behaviour, it’s good for security, it’s good practases programme. So there’s lots of applications of why you want to be able to un learn from a system, particular data points or particular behaviours and so forth. And that’s why that exploded as a field. There are some methods. I don’t think any of them work. So by far, nowadays, say because of GPRs, GPR that someone can go and ask, I want to be removed from this database. you use to trade am system that you have. If that usually happens and, you know, the company has to comply, usually what it does is to reference the system from stretch. because because the data is. because anything else doesn’t really different. anything else is you know. But this is a big topic, you know, people don’t want to be able to that. They want to be able to take the system and say, just align this one piece of information and give the rest as a reason. Yes. But there’s no reliability. We even know what good fees of information are available in our models, training, that is it. How can I be selected to the Muslim? He’s like a people problem. Yeah, I mean, in this sort of kind of GDPR kind of things is like, you know, say you, you put, I don’t know exactly how it works, but I never has the ability, but I imagine I have my images on Facebook and I know Facebook is using all the images to train what I find. And I say I don’t want to use my images. It’s not like Facebook is gonna go and search for all images of me. That is not what I have to do. They just have to remove this data that in their system is stuck for my account. So what about the element that are training, for example. Like how do they know? I mean, it’s just basically, so it’s a bit complicated, but yeah, if you can leave, and then, you know, for example, in a, there was a deep lawsuit from, I think, New York Times to open an iPhone. So, for example, if it becomes common knowledge that open eye has used New York Times articles to train ChatGPT, they can ensue, and if they win the case, and I think it’s a lot of the case that they want. They basically need to go back to whatever was their process that collected the data and just make sure that they don’t probe New York Times routine articles. I don’t know how they improve this import. Like, you know, you can ask. I don’t know even who has to do what. I assume you are anxious to prove that up and I use their data, then the judge would say, no, you’re not allowed to. And then OPI somehow hasn’t proved back that they corrected this and we loved. I mean, they have to preserve secrecy, so they cannot tell you what they have been training on, but they can’t somehow prove that they have never accessed the New York branch. But I’m sure there is some process, some real process to ensure these things. I don’t know what it is, but this is usually how it goes. But maybe more to the more subtle point here, in most of these scenarios is not that they need to identify some abstract concept, like, for example, I want to remove all of the pictures of myself, even if they’re taken by somebody else without me knowing that I’m into the data. that’s not what I have to do. because that would be really hard. That doesn’t mean that you can have a classified, that you can identify me, to remove all you need from all the data that they have. but they don’t have to do that. They just have to remove the data I’m going to, that I say, I have the copyright for this data. you’re not allowed to use this particular data. And that doesn’t really for 100 because just because it is you the certain but most it has that. Yes, it doesn’t fully, it just solves the legal issue. Yeah, so that’s why, I mean, and that’s sort of why we love you today, like if you have an underlying system, that say you want to believe some New York Antarcticle, you take that article, you build it together and say, you won’t have a phrase of the other models, I think they will be happy to use it, it is cheaper. I don’t even know, like, you know, this is the weird part, like, this is the legal aspects. that usually, like, because this came to, yeah, this came about from a discussion that we had in general. There are certain things that I accepted by a judge in court. Like for example, if I come up with an algorithm, I published that un or whatever and then I used this in my production to remove the math. This is not an overly report, just because I published a paper, a bunch of people said it’s a good paper, right? There’s not proof that you actually managed to all that, right? There is a you need to provide some kind of proof that that works. And there are certain techniques that are accepted by the court. And so anything that’s doing an experiment of applying is not. or at least the lawyers need to argue why that is sufficient to do theigning there. And then they are very complicated. And as far as I understand, usually everyone involved in this decision making process that is or not, they don’t really understand a human learning. So usually they just go 1st time, I think, like, if you can prove that I could, I never use your tapes out, like, that works, you know, and you have some proofs, some kind of, like, logs or whatnot, the shows that the data doesn’t exist, and you can verify the notes by, I don’t know, some kind of ways. So those kind of things right? But like algorithms, I think right now, this is really just a research area. But people saying, well, in the future, maybe you can use this. But, you know, this is not going to be, yeah, like it’s not anything that’s anywhere close to proper product. used for motivation of why. I mean… I mean, you can take those things. Okay. Yeah, so I talked about this, yeah, so I talked about this. like another issue is with like phone the life was transferred. This is also kind of complicated to the mathematically because there’s a difference between I want to recall what I had. And I want to have forecast, the forecast very increasingly means that I’m going to change what I had, because the whole poster affected transfer. Backward transfer means I’m gonna change what I have. So that doesn’t really go with the ball. So there is like a lot of people detailsail like this if you if you ever go into the I mean, this not that important because you really play when people look at the list of only one, you can that. you look at this place and you look at the behaviour of the system and you can kind of, okay, this is doing really what I was expecting me to do. What I’m trying to argue is that, well, as a human, it’s easy for us to validate these things, you know, we understand what we kind of expect from a model. If you try to formalise this into numbers, that you add that one wants to get a single number out of it, that’s where things becomes combination, because when you reduce every of this point to like a scalor, and then you’re trying to go weight it up into those scalers, things go quite around the way you lose. So this is, I guess this is what I was trying to provide you now. So, why is this, uh, the important is important because in the real life, This is partially why it field. It iss not like in art experience, so it canuced everything in a single number and then you can have a paper at the end of your paper. you can say my and is black because this number is lower than all the other numbers. particular level usually end up with like, I don’t know, 2 to 3 numbers. And that is more of a judgement call, right? You can’t really compare it directly. Like, okay, it’s better on this axis, but you need worse on this and it’s slightly better on that. And so this is good because it makes it easier to publish papers. So what happened at least in the area is to continue learning. People are focussing more on what is the idea that you prop about. it is an interesting idea it works as well and they care less about the numbers because they do not know how to do the numbers anyway because I have you know there numbers. It’s bad because compared to our IPLs is really hard to know how much progress we’ve made. If you look at your benchmarks, the other way to find differently, you have all of these numbers, you don’t know what time to care. So it’s a little result of everything is a little bit more messy. Everything is a little talk better place. So just it’s kind of the flavour of the of the, I guess. I mean, this flight is not important. This is kind of taken from the beginning of the lecture, more or less. I just wanted to say that part of this continual level problem is somehow specific to the primalized culture networks. Like, if I have a thing as they are classified, that they have in the 3rd day. you don’t have this issue of catastophic forget. No, you have a data, you have more data. Like it doesn’t matter which order you are data, it doesn’t matter. It’s easy to remove data. So this sort of thing, some of these unlearning issues, some of these kind of sort of forgetting issues and so forth. I start specific to neural network. And for other kind of systems, you don’t see that. And that’s why maybe you are discussing at some point that it’s interesting to think about the mixture of these things. And I guess this morning, we’re discussing about transformers and how, you know, if they do things in the context, I mean, just put things in the context, you can remove them, and that’s the way you want learning and that stuff. It’s because the context of a transformer is like this 10 years neighbour, right? So, you move to order the food things in the context, it doesn’t matter. Like the transformer just looks at the context, it has that point in time. So the situation of it, it doesn’t affect things. You’re not going to forget the thing you just put like the cycles at all because well it’s in the context. It’s not like you thinking of it as And yeah, so just to say. So there is this sense that, you know, we’re talking of like, whether we want to use a primatic number. I think there are these different dimensions that make this system behave quite differently, which is interesting and sort of makes a good idea to be able to make them together when you can. And as former… maybe I would repeat that when I go to the the lifeare of Transformers should be the next blocklo up of this. Easy some sense, maybe already that. It’s already like a mixture of non medicalical apartment because it is that you want an education that you can have. So this is just same thing. Okay, so now… This is sort of a great design slide. So I wanted to motivate how I think about Virginia writing. In particular, I had one definition, which is, was an upgrade understand, and I want to explain how I think, at least for neural networks, straight away and set. A big part of the issue is brilliant and stuff. I mean also better right? Actually, yeah, where we find industry. So let’s take the bread I usually have right like let’s take the 5 minutes red job and then when you’re back, I’m gonna know what. Where is that? Um… So, I wanted to, uh, want you to, like, the, the, the mechanics, did you change practice, when you do a year? So, I guess, you know, whether you discuss before, the way rated is at work. is if you go get the radium on your data or for your tires in a private data, and then you start tracking the variables from the wings. So there’s like 2 steps here that I think are interesting for what we’re doing. So the 1st step is, you know, if you complete it, it is brilliant over a media batch. like let’s just assume you’re doing full bad weight itself, right? You’re comparing these variants, or you take grandpa in your data set, and then you’re an average data. That’s the one. the other that you are buying the you can look at what happens in the private. Sometimes the part is, you know, you have some down, you hydrate there. And then the variants will do one of 2 things. So either try to push the radi up so try to increase the magnitude, or it will try to increase the maximum. And this is something that happens for it each, but if you look at it, I need to respect a single weight. This is what each example in parallel is trying to do, right? Because it great the computer always starts in the matter. So you can think of like, okay, you have a bunch of examples that I’m trying to make the weight larger and some examples that are trying to make the weight smaller. So when you’re averaging, it’s almost like they’re playing this cargo working, right? Some examples are pushing to increase the weight, some examples are pushing to increase the weight. And it’s sort of crazy sort of, I, I always talk about dynamics. So you raise sort of this kind of push-up pull coming from different examples. And the way learning progresses is that you find some kind of ecliprium where the examples that are trying to push the weight up, and example, they’re trying to push the weight down, kind of get to sort of have the same power in how they do the push. right? And then this is sort of how learning happens. So this is a little bit small as a lot, but like this is learning past one, pasta, together. And if you look at different radiants, right, the blue, you see that they oppose each other. But like when you sound them together, they cancel each other and they tell them. So now to understand what is the continuing problem. Imagine that you remove one of these force, right? If you remove some of the examples, some of the examples are not in the data anymore, because, you know, that’s fast one as you don’t see anymore. What happens then is that you only get a 2 grade there that is pushing this way and you don’t have the right way that it’s pushing that way. So because of that, like this type of world game cannot exist anymore, you basically push the rope all the way to yourself and you forget anything that those examples are trying to involve in that particular way. So this is how it’s like a toy picture that is designed to describe the kind of dynamics that are going on, meaning in sort of where you’re doing radio stuff, right? So then you have to each example, trying to push the weight, you know, pull the weight in some direction. And usually you find some kind of a living area, but if some of that data is decent, you cannot find anything easier anymore. And then I need breaks. And this is basically the mechanism to defend the silence in the neural network. You do using examples. So my claim here for you is that this taco word dynamics, these activities being resolved among examples within a neural matters. And for this other part, for this type of organ to work, you need to have all the data present, or at least in expectation, all the data needs to be present. And that’s why theI assumption comes from. So once this idea assumption is not distributed, you basically brought the equilipium and you cannot find the same solution. And that’s where kazami forgetting happens. Because you can imagine this task palatas too, if one disappeared, the other task disposed of the ways on the way it wants to be. So this is like a metal future of how this happens. and maybe just add to that. Another sort of side observation that we can make is that there is no explicit composition of knowledge. So when I learn what it is, I don’t learn, what it is as a function of it. So I’m not building on what I have. It’s more, there is attention, and B is everything that A is not. So that’s how you like. If you learn that classes, you learn them by what they are not, not by what they are, you know, and they’re just forced to do all of them. And, you know, if it’s not BC, D, E, or F, then it’s A, because that’s the only one that it does. that’s how it works, right? And you are, you’re putting here, that are actually living this sort of life in this, uh, this China. And this is a problem. I mean, this is, sort of, at least on the car, so we for the inside. in my view that this is a fundamental problem and it’s It comes from danger is that. And I said that because… We typically use grain and set, but we could use other learning rules, right? So my technology should be that there is potentially another learning rule that you can imagine, another way of learning, where you do create assignment differently, and that will not necessarily suffer from this problem anymore. but I don’t have to say the other. There is another story here that I found kind of interesting. Mr. Baker with this and again it Italian is very connected to this t of what day. So What does this mean in practice? So here it’s an experiment, so this is our thought experiment. We have lunch time exactly is experiment. But you can ask the question of what happens if I want to learn how to play paragraphs, which is a game? start after we have the bigger race, right? In this particular case, we’re looking at zerds versus board. So inside fil ra there is very game. And one thing that happens is we’re looking at what happens when you’re trying to learn to play with both races at the same time. You start the game, you start learning with both cases. What happens is these 2 races are fighting each other. So if you’re looking at ingredients, similar to how our generals have a word, then gradients cancel each other. And, um, because this is what the 1st one learning, but this is something supervised, and this one, I’ll explain it around with the BPC. In we have this photology that in the beginning the strength of your signal is we because the quality of the signal that we have to learn the direction that you have to move in is given by theology of the data that we have. So when you start learning in ourL, you do not see any work so that you don’t know which actually have to move. So your signal is weak. The scene will start becoming stronger and stronger, the more you learn. And then as you learn more and more, they see what becomes with it, yeah. And that is because when you are approaching a solution, when you’re converging, the radio starts vanishing. The gradient usually is small in the beginning, much was provided, and then small at the end again. So what you notice what you’re doing is what it does, um, learning what you’re doing is pass together. You notice that by chance, if one of the tasks, for example, when you talk to people, you’re bored, is doing slightly better, then you’ll have a spicy with that radio. And it’s a chance of the other task. And what that would do is that basically you add up only like one task. because this task is always strong than the other so the r will always oppose whatever task is doing. So what I mean, these plots are kind of trying to show the trajectory. So okay, so in this accent this is learning to very as part this is learning to play task too. And the observation is as you write this money has game, you notice that I mean, it depends on the scene and what that. But usually what happens is you learn how to play one task and it’s only once you learn how to play with their task and the very has vananish because you convert. You start learning from play the 2nd class. So, uh, So the idea is that this kind of type of dynamics and it’s kind of kind of something forgetting, it has impacts outside of the female learning. I guess this is the story I had to tell. So here what happens is you’re trying to learn task together. And because they’re feeling to each other, what you see in practice is that even though you’re seeing data on both tasks, your 1st 3rd task one, and after you’ve learned that’s one, you’re actually starting in task two. And at the end you know how to play both doess. I know that’s good. But this for us it was very surprising. So here the point is, again, your learning doesn’t work. While people actually do this, they do multitask learning. So you can’t learn just oneas at a time. like give all the tasks at the same time for the model. So education is ID. But even though you make everything like, if you look at the progress you’re making on a different things, you actually don’t have to question. So this is what happens in Arab. And this is what happens in supervised garden as well. So I There is a very technical paper here from Andrew S. and I did a lot of talk about your work. You don’t have to read it. but this is looking at these linear models and shows that mathematically speaking, this has to happen. So he does some fantasy math, and he kind of gets this old need that describes how learning behaves, and he gets that he has exactly that kind of pathology. Yes. And the theory is like, I don’t understand how that’s, that’s, those are also a weakness of continual learn because it shows even when the tasks are presented to the model simultaneously, the model learns the potential. Yeah, the community is not to be attended to extend this to all of them. No, so okay, so maybe that might start with the plan side. So this shows that, yeah, you know, you want to show these, uh, okay, so I maybe I haven’t finished my password. So, uh, yes, so the issue here is you like, you show everything, as I say, that, and you like continually, to a nice question, you want to, that, right? But the problem is if you show them the question one after the other, then you don’t like all of them. Because you have catastrophic forgetting. So as soon as you start, like, in protocols, if you don’t still do replay on, on, on, on, on Z. So this doesn’t happen. continueign you just like the last task and that. Well, like if you learn all of that together, you end up knowing how to learn to do both, but they learn technically the right surprise. But you just still need to see all of them all the time. So I think we saw the part time. So there is a sorry. So I put this right that is less important. I nothing on the go for. But here is the observation is that as long as you have the deep party, but you check on the shovel one, this will come more clear, if you are very, very, very sad paper. We have exactly the same dynamics. I don’t know where I feel on side on the side. I’ll go back and I couldn’t do it much. So there is this discussion but I did this discussion for a long time where did like the the method is considered to be feelingly efficient. So learning anything in your network requires lots of pain dialogue compute, right? So the question is, why is this learning so inefficient? And what of I play is that this shows a reason why learning is an ambition. So whatever you learn me, you can always decompose that into tasks. So anything that you compose into the task and then those in some tasks and so forth. And as long as any of these decomposition shows some interfearance between the things you arere learning. What that would mean is that you’re going to learn this little performance about your time, do you going to learn this financially? But even though you’re adding that you question, you have to show them all at the same time in an IE way. So what that means is that you’re going to waste a lot of compute and a lot of updates, even though you’re not learning, like even though you’re not learning 90% of the data that you’re seeing, you still have to see. And that’s what the efficiency is happening. So you have to see the data because otherwise you have to start working, right? If you don’t show the data that disappears and then the whole community doesn’ exist. But because you’re not actually improving on that task, it feels like wasting computer wasted data. So I think what you like to do is to be able to learn what on this task at a time. And that’s sort of how humans learn, right? Like a build going to like. like one thing and you build a letter and the next thing is a what. This is not a machine, that’s it because we don’t like that, they have this extreme inefficiency in how they concentrate on me. This was kind of a p. This is not potentially the only reason why things are inefficient. But this is one reason I are efficient. So inefficient, because you learn, you have a lot of interference, there’s suppresses the learning of different aspects of your data. And you can only learn those aspects once you convert it on whatever you’re learning now so that the interference in there disappears because is that someone here? I don’t understand much about the game, but is it 2 Asians play against each other like when I playing in the third world? No, they’re exactly the opposite. So, so, no, so this are like the agent playing the penalty, I guess. the agent, my objective thing I’m playing with, I heard the one is the opposite of my objective, if I’m playing with another one? No, no. Itides independent games. So there are both I guess the game itself. No, this is another month, it does, not, not, not ancient kind of, really, this is a big, not Asian. I was feeling if is like as you some game and playing each time is it and player give. So it has nothing to do with that. So these are different distances of of age and you just compute one way in America. So, But they’re like, there’s the same, like saying, both there, there’s humans separately. The human is played by the whatever the game does, whatever the AI behind the alien is. it just sort of the same age, trying to realise everything one. So yeah. have that pain. then there is all healthy. So there is one with us. Then we relate to that, uh, the Indian, then we see the variance from, uh, the, to just join the maids. so make their accounts for the radi from the other day. Yeah, so yeah, exactly. when me tried to replace differently but you’re kind of right. So what happens is you learn that one, the one the being so that way disappears, right? So then when you take the next step on task two, there is not that far, so you’re going to move the waves for a Jew, maybe more than you should have. So in the next step, because you’re still seeing the data from that one, then maybe you have to show up again and pull you back if what you’re doing is breaking what that’s one was learning. So like if you completely removed that squad, you do everything sequentially, as soon as you start lifting away from the solution of that squad, there’s no force to push it back because you don’t want to visit us one to see that, wait a minute, now you are changing, wait a minute, you shouldn’t have. But when you’re doing multitask learning, even though like you’re not saying you’re getting that one, because you’re staying very close to convergence, like any more motivation will give you like a small variation to tell you that I hope you’re not supposed to spend parameters. And that’s sort of how this thing works. And I find magnetic area. But like, yeah, so that’s the whole secret is like, when you’re learning the 2nd task, even though it’s looking like the only thing you’re doing is increasing performance on task one. Actually, the data from the previous task is really helping you not to break it because it gives you something every time you try to break that squad, it gives you something to come back and then break it. That’s kind of the meanery behind this. And this is like seeing, so this was at the task. I thought this is like a bunch of papers are not going into super details, but they’re kind of about the same thing. So there’s been this research, right? They’re looking at what happens to a particular data point in the data set. So there’s you know, which of these papers. But like, for example, they look at the internet, and they in fact, formally classify the image that, during training. And it turns out that you learn it, you forget it, you learn it, you forget it, you learn it and forget it and so forth. Because sometimes you get how to 20 of these events. Like you learn some later, you forget it and you learn it after burning times, right? So it seems like there are some people, example, that you learn from the beginning and they stay alert, and some really hard example, that you learn at the end. But you also give these examples in a meetup that you cycle for them many times. And this is kind of showing this pathological fact, that parasolic has, we didn’t start at diving. So if you’re not any kind within your life. This is your start in that rest that. You’re just checking to see what happens in which of the apples. And you see that you have a solid behaviour, you’re right, for example, right? So this is, I guess I’m just trying to study that this is a, this type of issue that is sort of very, very learning process or not only your average. Like, engineer and is trying to study that, and then I’m even talking about, you know, learning, but this is not… It’s not something special about the front of all that you set up, you know, like, It’s really around, basically, itself has this pathology in how it behaves. So these are the one example, this I example from eating that does this. There is other words that are showing the, yeah, so the other words that are showing this kind of relative to this band are going to lead up, they show that you can throw away a big chunk of your data and you get exactly the same result. For example, for C5, you can throw up to 40% of C5, and you still get the same result. So 40% of CPI is sometimes redundant. The way you find those examples that are redunded is actually quite expensive, so you spend as much time finding the data points. You spend a lot of time finding the layer points that you spend in this, right? But then the other pathological part of this that kind of tells something about the learning other way, is I increased the thing I said by half. But then the amount of for kids that I need to put to learn the task for which family does that is the same as if I had the full data set. So basically just because you throw away this data, somehow it doesn’t happen to learn faster. So I’m saying this because, you know, you might have this feature in mind that, okay, this data points are not useful, that the radiants are very zero. So like I’m just wasting computers, but every time I see this data point, I’m competing again at the 0 and I’m waiting. I’m done with anything. But it turns out that if you keep all of the interesting data points that get information, the number of updates that you need to do is always to say, it doesn’t shame, which is I don’t know. I found it very surprising. I don’t necessarily know if I can give you a good answer of why this is happening. I’m just bringing it up, like, because to me, this is very surprising what I did is, and it might come back to this and try to figure it out. So anyway, just a sumaries sl and says that people forget specifically it’s all about sort of dynamics. And and you can, from this perspective, you can really find continual writing as studying, something to fit the science in learning, in particular, you know, towards toxic Asia, you send the new networks, and if you solve this catastrophic forgetting problem, which is not the full CL problem, just as part of it. that this would mean to replace this part of our dynamics. Because if you don’t replace this, there’s no way of solving it and if you solve it, you might get very efficient lighting in general. So this is sort of a practice. So this is sort of a hope. So this has never happened. So this is sort of like, this is just like, this is what people continue learning what to do, right? So that’s difficult. Hope and the goal is that if they figure out another way of doing training assignment, that doesn’t involve this full and bad. They might get very efficient learning that maybe applied anybody. Instead of having all the tasks playing with the same parameters, can we make the neutral network some sort of like modular? Can we make it modular? Yeah, so there will be one direction that will go, so there is a, there is also like a danger there. So the issue is if you eliminate volitary suddenly without that interesting life. So let me give you like example that the say so this is sort of like an extin thing. But let’s say instead of a, instead of what we’re doing is for each example in a data set, I use little parameters. So like if I have a shadow model, like I say, okay, I have a beach, I have a weight, I have as many human minutes as data points. so I have, you know, there’s many sol in my weight matrics. So each data point, there’s a problem. and that’s sort of what is gonna like. So if you run it out better like this. So you know HDG, but if you’re a stranger, HD, you localise your HDG, right? You’re going to get 0 training terror very, very fast. you can run very fast. And what you’re actually learning is to memorise the data point in the col of weight. And what you mention the system that does not generalise at all. Because this is basically the fabulars, I don’t know if you guys need triangular stuff in Iraq, right? In the example you don’t have to that right, you just directly put in the so this would be exactly like a particular approach to learning about supervisor money. There’s no interference, there is no generization. I think that critical point is a lot of distance is example, basicallyically. In one sense, is a difficult point. So, and I think I’m always at the point where you start doing that, like psychiatrical technology. Yeah, yeah, yeah, yeah, yeah, yeah. Yes, this is I think this might be beyond that point if I’m honest. I mean, this is just because the way you’ve restricted the learning process happens such that you can exactly start the data set. But I’m just looking at the number of bites that I have in my model and my number of device. Probably I can, I can, I can represent money for some issues. It’s not like the the mobile is exactly the same size as mine. But when it’s close to that, maybe, in conceptually. Yes there was another question. explain fact forget it connect to pass im balance. So glass and garlic really have an effect. I mean, I mean I haven’t focussed too much on pass, but like if I have classic balance you have an fact particularly when I’m doing multitassiding. We’ll have a effect on which of the task will be planned first. So the glass that is larger, that will dominate, that’s going to be the one that, so in particular, in this stuff, sorry, in here, like this learning how to play Zurich and how to play proto, it’s kind of people with art. So there is no imbalance here. So what you see in practice when you start running these things is that you see talking about like a chance. like we look at this lot, you know, like you haven’t many red person no, this way as we have that go this way. So depending on the rather scene, the way you start it, you recommend it be sort of the task and there’s a task, the 30 days, you learn it, and then you have some other one. You can have some imbalance, then you can actually predict which task you’re going to be like a person. And that’s sort of where like for passion balance of play as well. And in the worst case scenario, the Imanas is really high, you might not even learn the, the, the, the, the, the. But yeah, glass imbalance happens into this. And you know, I mean, this is a very practical thing. that when people were training system t to learn all that time days, they would see that they would not learn all other ideas from the same time. you know that and BBC question. So you have if you have and tasks and you don’t want classifying You see, you learn that to question if you they all interfere with each other. But that’s not always the case. So what you’re gonna see is you’re gonna see sub group of us learning something behind and then you know they start learning and so forth. But you’re going to see some kind of speciality. But you’re all, it’s all data independent, and I think that it’s… So it depends on how much interference you actually don’t observe. If you compute the better in the gradius for each task, and you do the cost sign between that, you know, how many times are just the signs going to have, like, the negative matter, like, any matter, like, you know, like, you know, like, Yes, this are this side especially. Okay, so now this is more about framing of the continuous problem and discussing some of the intuitions that I have behind how it works and, you know, things like that. So now I’ve got to jump into how people have been approaching these problems you kind of get a sense of the kind of solutions that are out there. Yeah. I was just thinking about, if the purpose is to make AI play, like 2 weeks. then we don’t remember everything we learn. No. So why should they be bothered about the improvement? So humans do have, um, especially like the whole, okay, so the whole kind of something thing being actually started from communation on neuroscience. It’s I should remember the name. I get them with them. It’s like a paper from the 90 vari of apparently. And you I was really trying to look at ads from back that and sort of human subjects of how they learn and how they forget. And people forget this book as well. But they do not forget catastrophically. So like, you know, the machines you forget immediately for humans is sort of slowly decaying over time and it takes you a certain amount of time to forget some pass run. I mean obviously there are factors in psychological factors another things that I maybe I get past or what not. but typically people that is sort a very nice third and how they forget that you do forget it just sort of sort of a like non- catastrophe. So that’s why the way kind of is there. Um, In the early stages of the here, I mean, I guess the main focus that you know are something forgetting because among the other teams is the one that’s easier to formalise. So if the, the, it reduces the, this is the fear of between variance. And because of that you can now start to be algorithm You can start looking the gradiients, you can look at the 1st time and then do things. But I think, you know, the community are sort of engineering sort of how people expect this to go. We do expect a decision for need to forget. So the goal is you want to figure out how you remember, how you force systems not forget, and you want to learn how to forget. And then you want to have full control over this. and build a system where you control of forgetting happens and then magic. I mean, that’s sort of the high level picture of what people are hoping, right? So there is an online community that is learning how to forget a specialised place where there is a community waiting a kind way for getting that is learning how to force system notify with anything. And the whole plan is that the subpoint distinction come together. And because you have full control of both behaviours, you can sort of decide what is the optimal curve, what is not afraid of, how much you control, I forget. Whether there is the right way of approaching a power from that, I don’t know. It could be that like you have to do both of them on the same time, otherwise nothing worse. there is another there another techqueirt’s just coming up in you’re trying to do something like people composly play, sleeping in sleeping off people there it reduces the fire you see know things someow. And then when you say that’s makes sense, or rather, it is more of a, because learning is, um, ability to, this needs to be ability to maintain some things like, and then the human brain can retain some things to some certain level. And then when we transfer them to the AI systems or computers and all that, like, the memory capacity, the computational capacity, because the brains have 1000000000s of mirrors and all that, but AI models, when we increase the neurons or something like that, computational complexity and all that, like, what would you say, easy? Should we should we truly model the slipping mechanism of sobiology or should we increase the computational competitors? I mean, I think… Okay, so 1st of all, like, if I’m really, you’re the most dramatic person, it really depends on what you’re trying to do. Like, for example, if, like, if you do me, if you have a way of compressing the inference, which is sort of this kind of sleep, that it does, right? It kind of controls, the capacity in some sense, and then it kind of compresses information back where you sleep place, then you can have a cheap inference. like you try to build the product or you need to have a GU 1 like you want disability. I guess it depends on what you do. If you’re looking inside, um, I mean, so you want to do, uh, so now being talking is right, really agents that help you to do scientific discovered. you want to have this transfire around then we can ask questions. Like, you might want that system to behave very differently from you because, you know, maybe the kind of meditations that we have also limits us in terms of how we search the space and how we do things. So, you know, I didn’t imagine they are sending, or you prefer, it’s like, okay, it’s going to be expensive, but if I can afford it and I have an engineering solution to it, you just throw more computer, more you choose or whatnot. I would like a system that doesn’t have the same kind of white thoughts that I have when I’m trying to work a better system to solve some problems together. So I think it all depends on what you’re trying to do, but back to, I mean, I think in principle, that there’s really nothing, there’s a hardware that I’m trying to maybe this makes the kind of behaviour, and there’s other people that are trying to lose kind of isolation. And in principle, I think most of the time you want to do that, particularly if you want to have a system that leaves in the world, as I’ve said before, the world, that’s not what you do. So you have to be able to give the response in I don’t know how many milliseconds. So it doesn’t matter how many plasters you have. Your response to that, your latency has to be low. The only way to do that is to keep the system room small. Like there’s no other way around it. And therefore you do need to have this regular sort of distillation or this compression or whatnot. So I think in most cases you probably want that. There are probably some point I taste is going to argue. I don’t want miscompression. I’m doing, you know, I can afford to win as much as I want and whatever benefit I get from now doing, this compression is worth it. But generally, you probably do want that. And a bunch of other things are doing that. And they even use the technology while you sleep and so forth. There is a lot of communication in your scientists that land this topic of working your learning. but even people working between United science and they bring sort of all kinds of ideas from your science and Um, So for example, like a lot of the work that I’m going to describe here. So the field is relatively young, as I said, started in the 2018, 17. so all of the papers are from that onwards. In machine in terms that’s ancient, but in reality, those are really decent stuff. But anyway, like most of the early works from that period up to more like a couple of 3 years ago or so, tend to use this task incremental, I guess, some people call it, or like piece-wise stationary distribution, but that means is that you have a task one, but everything is IE is like a normal learning problem. And then once you learn the task, you move the task too, which is a different data distribution, everything is ID, you learn this task, you move the task 3 and so forth, and you just want to maintain performance is more through this task. But every time you’re learning one thing, that thing is I. So this is in difference. So I was giving you an example, I was saying, like, what happens, if you see all zeros, that all ones, like, if that’s an idea, nothing is I need. I take anytime we go and look at it it’s all the same ag. So that’s not ID sample from the overall distribution, right? So there is a difference in higher than this. Here’s what I’m saying you learn to classify one and two. So every time you book, it’s like, it’s randomly samples, ones, and 2, and you’re like, it is binary classification tasking. and then that’s task you classify 3 and four. So everything is shuffle we need 3 and 4 and everything was fine. So this is the traditional setup. Why in the used? so a lot of people are kind of comparing the lot about it. So it is not You know, this is not necess a natural setup. It’s not like a world behave like this. This is children because it allows you to measure things and it allows you to control things. So it’s more of a, the protocol is very likely because now we have a validation, loss defined for every task, every class is a normal, supervised task, so you have a foundation, loss, you have a ten plus, you can control whether switch happens, all of this stuff. So it’s really convenience. There’s nothing else today. As of a few years back, now we’re moving into a space where the new benchmarks are more about like a continuous brief for the data distribution and all this kind of stuff. So people are moving away from this framework. This is just sort of where we start because most of the Americans I’m presenting are going to be within this framework. Many of them you can extend. It just takes a little bit of imagination and very sad, I’m going to be added very much. And particular, one big complaint, where we look at the time of the, make sure I’m not, keeping you guys more initial. You have more time. So particular one peopleaint that people have with this very much? that they have this priority, that we have this, especially quite some time of research task. And you know which task we are switching to. So usually what happens if you know now doing task one and you know it’s 10,000 you switch task 2 and you know you’re doing classical. People say this is very limited and this is very not natural. In reality, you never are told you’re doing that fun now and now you’re doing that school. This is just something that involves naturally. I need to say that this sort of task interests us. the way has to make that this process of differing the task. It’s actually not that hard. And there is an easy way of doing it. And this is not an limitation of these outaters. So in particular, the typical way in continual life, when people are going to have interest, is the train a generative model that is trying to learn sort of the distribution that you learn a lot. And whenever you have a new data point, you look and say, how likely is this anomaly, this data point that the distribution I’ve seen so far, and if it’s likely, let me tell you, this is the same task. If it’s not, you say it’s a different task. So this is like anomaly detection, you know, so basically whenever you see something that changed drastically, you just assume you’re going to be moved into the new task. and then you can use this to interview further task. So you can do this in different ways. So you can have the generative model that gives you like a distribution so you can add how likely you data programme distribution. You can look at the error. And you see here as a being spike in error. So you have so your error goes down and you’re like a point one and then some of your error becomes five. you know something can happen and then you assume, okay, there was a task which norm talking here. So then you can react to that. Or at another perspective of this that I find cool, is you can be ancient century. trad people say that most things are environment century but it can be agent centric. So the approach here is something like you learn. And you look at how much your parameters have changed. And at some point you can say my parents have changed so much. I want to protect this. I don’t know if I’m switching task, but I’ve learned enough. So now I’m, you know, starting my computer and job starting to like for the web so far. And then continue learning. So here you don’t depend on the environment, you don’t care to see if you can do that is not attention in the environment. You’re just looting integrity towards yourself and you’re trying to perform how much you’ve learned and you decide by yourself, you might learn sufficiently. okay, now it’s time to stop and constantly take whatever so far. Regardless of what the environment person needs to do next. So these are the type of ways of dealing with detecting the current task and attacking the futures. Oh, this. But this generation model, this generation model, does it guarantee we drop in the percentage of catastrophic, um, they’re forgetting, um, they are forgetting it. No, the generic model is just there for me with that when things have changed. it is not going to happen forgetting. And it is not necessarily without problems. I think there’s some point we’re discussing how you try to model on CPR and it’s fine, SVHR, more likely than intelligency car or something like that. So you’re going to have those kind of issues where even though the distribution has changed drastically, because your normally staying is some way, you’re going to believe that no, actually you’re still in the same task. So it’s not it’s not without problems, but because really the issue is computing learning and distribution of the data is not easy. Like the density estimation problem So we’re just trying like, you can, I mean, I can tell you what, what is, so this is the scheme we use where we need the one algorithm when we were playing that, right? So what we did is we use binormial wear the pizzas. That’s not alternative model. And the issue, I mean, the the only reason that work is because in Atari, which is kind of funny is you take the whole key and you go the average pizza or the savage on the colours, you just look at the average mix, you almost get temperature energy in your brain for the average pick up. Like the images look so different, but you can just average everything and it’s still going to be off of very separate well. So therefore, that, you know, Tari is, you really need to actually get, figure out which attai game you play. Because they look so different that like any single classifier will tell you using that. But in general, this idea of the majority model, it’s a little bit tricky. There is one problem that the Tennessee models can fail because it’s a hard problem it has to do. Another problem is the genrity model learns so slowly or learns too fast. And then again sort of mess it up and everything. So I’m just saying it’s a solution that we widely used and if you are learning to work with this in Europe, but it’s not necessary without all. It has lots of intel issues as well. There is there been traditional here. So as I said, in the last few where people have moved to this more like smooth transition and since the task I was telling you. So it’s not found, but they have computers, they have buses, they have cameras and they have different things. And these are supposed to be through certain times. This picture is cameras over the years, right? And I supposedly there is a change in the distribution, as the years go by, how left of my new, or how new, and so forth. I don’t really see it. maybe they don’t have a suffision large Maybe Windows exchange, then it is something like that. But anyway, the whole point here was that you take some consistent thing, like postplay and probably depending on what is in passion at that point in time, you start seeing different kinds of costumes over the years. And this creates some kind of non-stationarity, but it’s smooth. It’s not anything interesting. There’s no boundary where you can say, okay, no, this is where things have changed. So this is just for for, you know, to cover our bases. So this is typically usually called online contin learning or continual online learning. yeah multiple names, music paper. But most methods that we’re going to talk about that are in this district task thing, they can transfer to some alterations. The other difference in this space compared to your original space. So when you have this discrete task with boundaries, people usually worry a lot about philosophy forgetting. In the newly teacher, from a couple of years ago, when they study this problem, they usually do that forecast. So certainly that is the same issue of the computer for numbers. And then which number are you going to emphasise in your paper, which one are you going to be most proud of? And in your paper, everyone was talking about that. So you forget the number, no, everyone is talking about their 4 transfer number, and they’re kind of invited. That’s sort of the community 100 about. This is the last flightide before a job with the solutions? I will do this slide that maybe the 1st solution and then I’ll start. should be more enough. The other thing that I think is actually quite important. is that almost all the algorithms that exist out there, that they were philosophic forgetting, They make this multitask assumption. And this one is the past assumption is is implicit. no one talks about it we have a paper now that youre trying submit trying we is playing this assumption. And it’s also sometimes weird because it’s assumption is that the agent is much larger than the world. So the assumption here is that if I would have all the tasks at the same time and no one has money, then this would be off. I will be able to solve all of them. And then what I’m doing with material, I think is can I see the task part of the gather and get the same performance in the market. So this is a interesting assumption that all the mats are making. that this multitask solution is optimal and it’s the pastity group. And what we’re trying to do, whether you’re learning, is somehow for what to make this multicast solution. is you don’t understand why you 2nd going and point comes from the best world. because the assumption is that if you want the task, we can repeat that on the task. So you have the capacity to, if the world is the the secret of passage you’re seeing, like the fact that multitask is popular, it means that we can do that all of that. It’s not a capacity issue. It’s just that you learn that first. also at the bottom when it comes to computer unless I’m having some kind of some of recognization be able. I assume that a new gave it a fit in even from. Yes, yes, there is a, I’m just saying that by, um, I’m not saying that the, the capacity is not an issue or it’s not an issue, what I’m saying here is that when these matters are saying that the optimal solution is a multitask solution, that they’re basically implicity assuming, because they’re basically assuming that the best thing that you can do is imitate the amount of task. It not even that this is the buy I need to compare to What they’re actually doing is you look at the multitask objective, which is lots of tasks, one plus lots of tasks, and they’re trying to approximate it to that point. They’re assuming this is the optimum name. And then by my exception, I feel that they’re assuming that you’re kind of like you’re going to have the ability to learn about this task. Like you could argue that actually the best thing you can do is to forget aggressively because you have a capacity. So you can learn a new thing. But they’re not trying to do that, right? They’re especially trying to memorise everything that I see. And then the assumption that this is optimal, which, to me, says that, you know, they increase the assumption that we have enough the question to start about this. I’m not sure that. can be what’s wrong It’s not wrong. I mean, almost I going into this. There are situations when you get a little better than mountain task. If I already have to think about my reach at right time and the same moment and I can design and rule by time. Can I do anything because I’m mult. It depends on what you measure, but if you imagine, like, for example, something like we grab, it would be, how, well, I’m doing a dishboard in time as I go through all of this. Like sometimes I need to learn something and it’s not going to be useful in the future for a long time. It’s actually a matter for a dentist so that you are more classic and intervent more. And whenever you get to learn heat just ask here to bearning from scratch. So it’s not always clear the best thing is to retain all the behaviours that you have throughout until the end. Sometimes depending on how also the behaviour, you know, see it in the future or how much it will interfere with whatever you have to like next. Sometimes the better thing is just forget it. and move on and relearn it from your reading. And you get contracts, easily, kind of coroner taste sequencies that have this kind of structure. And then you can show the multitask is actually really sub-off tomorrow. And even just a simple functioning agent, if you’re looking at the performance it gets, it will be better in the end because it doesn’t have to deal with that. So there is a sense. Okay, so this is… So that okay, so this will be the last slide probably. So this cat running from one side to the other is kind of kind of thing. But you can have this task where, for example, So you have, you have almost a squared wave where you 1st to interpredict only one, then you need to put the marathon and I try to deploy the marathon. And when you add this mechanism that ties to help for you not to forget, you basically end up with this blue curd that in the end end up having some kind of meal between those 2 values. While you’re better off forgetting and writing everything from scratch, it doesn’t matter. It is not the best example, but it’s a toy example. kind of trying to administrate this picture that you 2 guys contradict each other. So if you’re in the same state, that’s what I said, you should go left, that’s what I said, you should arrive. If you’re trying to keep both them at the 2nd time, when you investate, you’re not going to know either left right right they’re going to stay there which is worse than either of the choices. So you better just take the choice and see that is the wrong answer. So that I think that is the sense which Monitas the menu is not always optimal. But it is optimal in a lot of instances, and it is sort of what most optimal is happening. If you are just trying to inspire on that they show that everything has has a cho case, everything has expiration of doesn’t practice. And maybe this is the algorithm that there is widely used there. But this is how we start by. So why do we have so this is a whole family about items? So what do they do? is you have the one that has loss? A, a, a, a, C. You don’t have access to lake and B. So what do you do? You do a 2nd another thing, right, question of those tasks, and then you keep the cash shares. And the heat is the ashes. is the ashes are compressed for all your data. Obviously so that the h might be more parameters than your data. But this is sort of a prototypical approach. So the pratypical approach is you have to learn it as objective. There are some thoughts that you don’t have access to. You replace that by a regular eyesight, which is the right time, which are… So that’s how you maintain that that normal day, right? within that over day, you have to throw away some of those people that are pulling on the road. You have to reply that by something else. And the class with the regularul rizer and the regularul you all of this mathod like lasting representations is not the intelligence, learning without forgetting and so forth and so forth. There’re just different approximations to those stars. Like you just talk about I mean, is it the same class. This is one of them. I mean, like the 2nd another, hey, like, you can do maybe another approach. Yes. Do we need to know which waves are important for each task? Yeah, this is what the Asia has. So the Asia, it says, if I’m changing this way, How much will, I’m the very able to do that. gonna tell you the same the H is going tell that after the 2nd already because the is especially exactly that. If I checking a,000 it is into the parameter and makes sense. But there is a finish of I mean, this is a specific analysis metric, right? That is the definition of what you trying to here. you you’re looking at a particular when you’re trying to prot that you’re asking your question of how I much is this backing the l And he’s attacking me a lot, then you have a time constraint. So you say, okay, this way has to stay the same. If he does, if the loss does not change, then he said, okay, this is the way you can use the light, because you can change it anywhere you want, then it’s going to pass. So we do this approximation while changing the weights for the other task, for the new task that we’re trying to learn. Well, doing this as a switch point. So we learned as one, after learning class one, we stop, we compute, we do this sensitivity analysis, to see which are important. And then, we fix those weights, and then we continue learning tasks, too. but you don’t drink of your the analyst. They have they said, I mean that obviously does I was saying, this is optimal, or you shouldn’t really compute them. You say this is the other way. In terms of like regularising the race in a continuing learning problem. Is there a way to write? Is it at the list, no larger, as a brand? Yes, a tradition. Is there any way, like, the way that is, the previous environment without affecting the performance, etc.? I can say it’s not, you know? That’s a very good point. Yes, there well, it depends a lot on the church factage on that. But it turns out that, um, yeah, we can, we can cancel our, uh, tomorrow’s all, but it turns out that, like, for example, if you take a lot of money just to make things easy, right? It turns out that the norm on each layer doesn’t have matter. There’s another fact where you make a decision. Actually, the only thing that matters is the direction in which you are very separating. That is the only thing that actually here is information in terms of your representation. So one thing that you can do, you can renormalize the weight of every layer, is just the top layer you come to normalise. So the way you do it is, like you noticed, if you remember, in a real problem, you can occupy this layer by alpha, in the one by the manasota. So what you do is you multiply the 1st layer by one or more. multiplied there on top by the north. So it takes a lot and then you keep pushing around halfways. there are there and then it goes to the top player. And now all your weights are normalised. So this is why we want, please, go on with any functions. Doesn’t affect the performance of it. It doesn’t change the function. It’s exactly the same function. So this is a symmetry that you can exploit you haven’t changed the fact. Well, people doing practice is sort of the non-drastic thing, which is you independently project each weight to the to the units here. So you just remove the norm without correcting for it. So and that works fine. I mean, now we’re getting in the function. This is that one, but it works fine. And then this is a practical I think it’s for great standardation. It’s the same you have rent on in a way on like we another one of the students for wternization which is not.

LECTURE 9:

Um, so, so the high level, like, I mean, there is no fundamental detail, it depends on linear models, and it’s focussing on theory, and it helps you on the emergent bound. Contin learninging is sort of planning or sort of like a new name that there the of in 2017 um and is really focussing on the neural networks. Um, and some of the issues that come up, when you’re trying to learn in these rotation are setting, if your networks are somewhat difficult, you will not find that exactly in the same form in your model. So they do study sort of different side of effects and But they’re trying to ask the same problem. Maybe another dealer difference is in online learning, like the whole field is, um, centred around the idea of regret. So what they’re doing is like saying if you’re learning on this, while you’re looking like the area are it occur but you’re learning this. How does this compare to, like, some baseline? I have a straight time in that case. And this is sort of your address. This is how much worse are you doing compared to this baseline that is trying to learn the same thing. So, maybe the details of the definition of regress is not important, but it’s important for the activity that they do, but the point of that is that other fields, they quickly focus on forward transparent. So in all library, you will not be worth asking the question about forgetting, for them, because the only thing that they focus is on regret. Regret typically measures, whether there is any kind of voice transfer, because you have the baseline, which is sort of usually a simple baseline, that doesn’t do anything pressure. what you want to do is to learn faster than the baseline. That’s kind of the, it goes online. But otherwise, in terms of like problems, settings, and so forth, there is a lot of common ground and actually nowadays, there is an effort by different people to kind of try to link these 2 fields together more. So for example, and I forgot the name of the algorithm, you know, I love you, but there is um, Andrash was claiming that the CWC was granted that the here. actually there is an algat of that looks almost the same in all online liking. So, you know, there is a lot of, like, they’re assuming similar setups that they have algorithms, like, you know, replay, continue learning, has other impulsive play. So there’s a lot of common ground. is just they’re completely different communities. They typically don’t talk to each other. They have a very different publishing plan. Continual learning as a employer, subkill is very empirical. So if you look at the papers, there’s usually like some pretty drawings like this, and some diagrams, and then a little bit of math, and then, you know, some numbers, and some of this. An online paper is just math. They really are just trying to prove things. They don’t care about running experiments. They’re just trying to show that, well, I can, I can reduce this constant in this bound. Let comes to the a price. Like people usually don’t like this kind of beauty works because these bounds tend not to be significant. So you could have to, or either one has a better bomb than the other, but if you run them in practice, it just doesn’t work that way because the bonds are so far away from what you’re doing, they are so loose and bound, that improving that constant there by some echelon doesn’t really mean anything practical. So you know, theority of the fielduel is working on this kind of empirical work where usually they carry a lot of more practical. I don’t know if that helps, but yeah. That will be my differentiation between the 2 fields. Any other questions? Okay, so then I’ll I’llll I’ll go into the flight so I’ll I’ll start with a recap from yesterday and then go to the current flight so This is a recap we So we started develop P We went through some ideas of activity. We went to look a little bit of document limitation and SDD and things like that and then and then generalisation for autimization and we looked at things like double des and visit recognization of HD and all of these kind of new results. And then now we’re in this module where we’re looking at continual lighting. So here the idea is that we’re not in an IV setting anymore. So the data is not sampled from an distribution, there might not even be a distribution. You have a stream of data that keeps changing over time, and then you’re asking the question, can I still learn in a 39? And in practice, you know, in order to formalise the problem and try to understand what’s going on, what we do, we actually use this pesoizationary kind of setup. What it means is that we have a sequence of tough, not a sequence. So not a lot of stream. But the seence of transfer, each task is well defined. So each class is ID and you have, you know, a training set and a validation set and everything, but then you have to learn this fast one after the other, sort of by just fine tuning. And this sort of, it’s just a construct that really helps you to like figure out what’s going on and be able to make your things and so forth. It’s not necessary that the field police and this is the problem we want to solve is just sort of a formalism that really helps you to make progress. And then we just ask that there is multiple aspects that you might want to happen in a setting like this. One of it is you shouldn’t see karosophic forgetting. So as you learn from on that to another category you forgetting makes you completely forget everything that you learned before. So ideally you want to maintain performmances you guys learned as you go to the task, you ideally you would like to start writing pastaster and faster. So you would like to have some kind of word transferred because the underlying assumption in all of this visitor is that there is some underlying structure between all of these tasks. so the task are similar to each other in some way. and you should be able to exploit that, right? So as you go from a past another you want to start like past and faster by using this. And then there’s some like funnier things like backwards, transfer, and so forth, which usually never happens in practice, but, you know, they’re always nice to have, so this is more like, After you learn a new task, you want to improve like previous tasks, because of their shared structure. And then there is a bunch of constraints, and different formulations of the problem. So even, you know, from paper to paper, they would consider different constraints. So this could go from, you’re not allowed to see any of the previous data, so you can only store a little bit of it, or you have a pilot number of clubs that you can do 1st steps, so you limit the amount of compute, or, you know, you live in the types of the model, the one that has to be fixed, and so forth. So these are constraints. And usually, you know, there is a subset of constraints that each other is considering and usually there is some constraint that the algorithm is not considering. And depending on your application, you might hear about one or the other. So for example, If you are working, I mean, I actually don’t even know what the right, I don’t. But if you’re working with a life like Dalalamb, um, Maybe you don’t want to serve a lot of data because like when you go to kind of preserve previous distributions, maybe that will become too expensive too quickly, you actually need quite like voices of data. So that maybe you would prefer some kind of restricting locally some parameters. don’t want store or you might not even want computations. So there’s all this kind of restrictions that come from the size, from the kind of data that you have and so forth. You know, in some, in some instances, it’s totally fine. You just store everything that you’ve seen, and then just then you have some search and develop and decide what to look at again. In some instances there’s probably one kind of reason. So this is like a robot that’s in the wild, then, you know, like you have, you know, finite memory, finite computer data, you have to be there to follow. So in terms of algorithms that exist out there for continual learning. So what I’m going to try to do is kind of classify them into a bunch of family algorithms because a lot of these algorithms that basically share the same idea behind it. The algorithm that I’m presenting right now, they’re mostly focussed on solving catastrophic forgetting. So as I said before, the continual learning, as I said, it’s a new field, it’s evolving, it’s actually growing surprisingly fast. So it’s sort of like kind of one of the new cool things to look at, like a lot of the new startups that are coming up now are around adaptability or continual lighting. So that’s kind of the new tram. so everyone is saying, oh, sharing doesn’t work anymore. We have to look at the adaptability. And continual learning is basically the academic field that is looking at that. I feel like, you know, sometimes they start out with bad commuters and they claim they do something different, but, you know, people have been working on this for for years and there’s some theory behind it and there’s some algorithms. And this is sort of the field that loves that. But it is, and you feel that that’s evolving. It started focussing on the process of forgetting because he is the one that is easier to formalise partially. and you know, usually you start where you can make progress and you just do what you can. And I think just do kind of add a bit more context around that. I think the intuition, if you look at the early days, and I want to author some of these early days, making it the framing we used to have is, you know, we have this list of the interactor, like we want to have, uh, prevent data, so we’re forgetting, we want to have for transfer and so forth. But we start to catosophy forgetting, we fix this, and then we move to the next thing and we kind of put them together. I think now more and more more people are starting to think that you can’t actually do that because there is a difficult. There is a lot of interactions for from these different things that do us and some of them are kind of fighting each other mechanistically. You can’t just say, I’m just going to look at kind of something for getting, I’m going to fix this, have that, and then add on top something else. Like, you kind of need to consider properties you want your system to have that one because, you know, there’s a lot of interaction within them. So this is just sort of my, my, my new feeling of where the field is going and how people are not thinking about this thing. Sorry, in particular, I, you know, and we’re going to have some examples of this. Um, Yeah, I don’t I don’t think you can treat each of these different ones in isolation and then just compose them either. I think, you know, you need to think about this more holistically, but, you know, doing that aside, so this is where most of the progress has been made in the field, in the cutter, so people are getting the part, and that’s the one we’re going to focus on in the class, so you can get some impacting point, and then if you ever want to jump into this space, you know, do research on this topic, you cannot have the newer stuff, uh, who I transfer from those parts of it. So, um, Usually, I guess people, I think this is what I really like about. Usually people split the type of algorithms that we have for. So if we’re getting into 3 classes, The 1st class, there is a regularisation based publisher. So here, the idea of, you know, if you go back to the track of more dynamics that I was trying to convince you of, um, so I feel like, um, Probably this is not the next group version of how people describe that. So we’re forgetting, but I feel like this type of work is a very, like, kind of intuitive way to understand what’s going on and I think it’s going to save you a lot of time instead of, I don’t know what it is about politics. But, you know, if you go back, let’s talk about it, dynamics thing is basically, you know, you have these 2 teams that are pulling on the road. And then, what, continue learning means is that you can’t have both teams at the same time. So obviously what happens, so you have a single team pulling on the rope, there’s no nothing in the sub, this one just pulling the rope all the way to the. So the idea of regularisation based, if you add a regularizer that’s actually in place, or the other team that’s used to kind of recreate the ideal the 3 years that happen. So what does that happen mathematically, mathematically? You basically write the multicast objective, so you write sort of the thing that you think is optimal or ideal, which is your largest solving the current task, you’re solving all the task at the same time. So if you have 3 cards, where you have the soul, you just write the sum of the 3 lost it. And, you know, assuming you have access to all future data all past data. So this would be optimal, and now you’re looking at this and you’re saying, well, there’s some terms here that I don’t have. I need to approximate them. And then you come up, we please spend the waste of approximate them. Um, one simple way to do this, for example, it’s, You take take the expansions of the terms of the task that you’ve already seen. And you say, I’ll turn out approximate businesses with a contract. Um, and the nice thing about trying to approximate the lot through the photographic. So, okay, so maybe some details just so you’re aware of why this map does not look much higher than this. So this might already look ugly to you, but it could have been even worse. Um, Because you 1st learned past K and it converts on past K. When you take the Taylor expansion, The, um, The gradient, other pollution has to be 0 because you’re from very. So because of that, all the turns disappear except the 2nd audit turn. I mean, the 1st audit turn doesn’t disappear, but you don’t care necessarily about that one because, because the grain, yeah, with respect, that one is going to be zero, but it’s not going to be a function of the variables that you care. So what you do is you take that lay of Tita, you know, be an expansion around Tita, A, so the solution that you had before. And because it’s a solution, you end up with just the settlement. And you know they’re selling for all other cars that you might have. And the nice thing about doing this exposure is that now this session that you get here, it’s basically it contains the data. So this time here, the station is computed over the data that you have. But you can think of it as a compressed form of your data. It’s the result of incorporating the data. So now you donate data from this previous task. You only need the head share, that you provide it, and the TIA, and it could be the big points where they work. And we also replace certain data. And obviously, even that it’s a lot. So, the next approximation that you do is you make this magic diagonal, you throw by anything that’s on diagonal. Another approximation that you make is instead of computing the hazian, you use the gradient squares. So basically all you have to give is, like, an expected value of the range and squares on the video. So very, cheap version of this, this is not what VWC does, is basically you take your other statistics, so you finish learning task one, that is your, your attention, and you take the solution that you have on that point, and those are the 2 variables that you. And then, The fanatic term has this form. Tita minus, the solution that you have multiplied with, basically, the exact is, like the, the t-shirt, the Hashia, or the expected grain is desired. Depending on the approximation you make is kind of equivalent. Um, So you can try to think intuitively what an operator like this does. So the idea here is that this creates a force. There is pushing you back the data A, so the previous solution that you have. So you still transpire, you have a solution and now you’re moving away from that solution because you’re wearing something new. And this is the reason why this is called the lasting white concentration. And by the way, this is inspired by synaptic consolidation, the neurons. This is supposed to be a very biologically implausible mechanism. Is that what you’re doing here is you’re adding this force that is pulling you back. And the strength of the force is given by how important was the weight that’s all being transcribed. This is basically the increation, right? So, this is, you can think of it, man you very sure. You can think of this as a sensitive analysis that tells you, if I notify you for a weaker, particularly if I’m looking at a diagonal, right? Besides, if I modifying this for a retard, how much does my loss on task one changes? And if you change it by a lot, then you say, well, I have a very strong force that is pulling me back for this particular private charge. Right? I have the strong padratic that is pulling me back. As soon as I move away, I pulling it back there. And if this, if changing the parameter does not affect my loss on the last one, then, you know, this, I have a very weave watch, right? So I can actually move quite, quite large. So this is sort of high level, the idea. Now, We have maybe, uh, maybe more algorithms that are based on that. Um, and it just like, I mean, we’re not going to go through any of that, but it just sort of, I mean, you can see it, right? You have so much freedom in how you define these things. Another way to define this is your probability point of view and it has things like VCL and all of that, where basically in that, just have an idea of what happens there, this regularisation can become your prior. The way they’re framing it there is that, you know, you have a prior to watch it with Tita, and this prior changes every time you see a task. And now you have a data aware prior, basically, telling you what could be, um, um, Yeah, one or 5 either, it also makes sense for the previous stuff. And then because you move in the probabilistic space, you know, some things change this year and you get a slightly different formula. They don’t have the same flavour of like, there is a port pulling your band. And there is a very efficient that tells you how from this poor seeds. All of these, particularly when a pilot is including deep neural networks, the neural networks. Although it is end up doing this, can depend on differently, but I mean, that. Riding their stamalgorithms that are considering sort of here a proper matrix, that is looking like a correlation between it. But most people would make this into a diagonal, right? So, Another way to look at this, maybe it might be super useful as well. Like, yeah, Digo is from L2. Like if you’re, If you use like alto-regulization for neural efforts, I will too, basically what it is, is that it’s Tita minus 0 times T290. So this matrix is identity because the voice is the same for all paragitat. If you replace these things by identity, anti-ty by zero, also it’s just like basically a force that is pushing you to 0 or pulling into zero. We keep on weight or not for any. So what this is doing is like in that zoom, but instead of pulling into 0, it will be a solution. And you have sort of different, different ways of the, of the regular rizer for different parameters, where you are kind of looking at how important those parameters are. So this is this is the regular edition by solution. So as I said, this is a big family of black birds. There is a lot of things going on, but they’re all under kind of this rough presumptions. What usually people say about this method is that They’re not the best ones, even though it’s the G. And usually the reason for that is they make this patratic assumption, right? So the point here is that this works. As long as when you learn your 3rd class or your 20th class, you can find a solution for the new time that you’re learning, that it’s within the region where the 2nd order approximation of your PBS task force, because you’ve made that. And usually people say that this is very limiting, right? The idea is you’re done in expansion. I have this little area around where you are, which is where your approximation holds. And when you learn something new, you have to stay within this trust region. You’re not allowed to live in it because if you leave it, none of these approximations will hold anymore and all of this breaks down. And usually this is presented as a big weakness of the methods, but they can only account for obvious technoid information. Um, I don’t see later. But these matters, it doesn’t seem to be that problematic. But yeah, includively, that’s sort of what’s, what’s going on here. This is no problem, right? This is a problem of authorisation, but the model, the model, I guess, is capable of… Yes, yes, yes. This is this is yeah, this is not like a fact. This is the learning problem. And this, I’m not sure exactly if you refer into this last remark that I made about the the pantalical transformation, but it’s also not the capacity thing. It’s really a learning thing. Um, and, um, I guess what people expect, they say, okay, you’re in this part of the space, right? this other part of the space, where you have a submission to go, then you can’t go from here to here because you have to travel by far. And this is what people are kind of assuming. I think what we will see is that in fact, it just doesn’t really happen. But yeah, I mean, um, Once you start digging into continual learning suppressions, you will see that most of the what they’re addressing is like learning problems. We should do learning processing. And that’s why I said that this world is bigger than the agent hypothesis I don’t like, because to me, there’s never like, um, the capacity issue that you’re trying to solve. You always try to solve any problem, and I understand why mathematically they work bigger than the agent is something you can you can do math on and prove that it’s true. But it doesn’t seem to be the probability based in fact. In practice, you do something like this that only fixes the learning dynamics, and you solve the problem, right? Which kind of says there was a very capacity problem here, multiple learning dynamic problem that you have fixed. So, so yeah, so that’s um, Maybe the, the, the, the substance of my criticism to that perspective. But it isn’t, the world is bigger than the agent, it is the best formalised. Um, I demand perpetual that. So you have to keep printing for that so that they can, they can actually prove things compared to all the other arguments, but I’m going to be able to have it fixed with that and so forth. So, so, you know, example. Okay, so I listened exactly what I was saying. This works well, but the tasks are similar because you have to do these bigger improcinations. So if one touch pushes, you can move very far away, everything breaks. Um, Maybe this is not intuitive, but this method turns out to be the most elegant, most of the time, the most elegant, and the cheapest. Everything else would lead to your training cost to increase quite a bit, maybe being nearly in the number of tasks. This thing is quite cheap because this is a big start, like, this is, like, there’s like another 2, which has no 0 weight compared to the COVID and backpage in your model. And actually you can start folding disco graphics. So you can end up doing only one contract rate. It’s our right, all the different, the right thing, you know. So there is a, I mean, I’m not going to work the technique, but you can, you can move forward, so like, it’s a variant of this algorithm, where you only have one for traffic and you’re lost, no matter how many it does to exceed. So this makes this story of method is going to be the most relegant one. It’s also the most biologically plausible because this is a local objective. So if you look deeply here, this is the local objective that happens in each neuron. So in some sense, it’s very elegant, but people assume this is a matter that works the worst because it does this tailor approximation, there are we. We, Underbell, yeah, sorry, there’s a lot of facts here, lots of paper, you don’t have to look at the paper. The other variant is, um, I guess people would call them memory-based solutions or deep light solutions. And this is the indirectly like the replay you, you see in the chat. Like, I, because I assume you can’t take care, because it has another module. Um, Iran has exactly this problem. So maybe I wasn’t that explicit. So there was um, There was a point where you had the continual learning community and the continual learning part for microcommunity. And the reason you had that kind of, that now is kind of being right, is because the important writing by construction is a non-stationary problem. So RL just know about RL, it’s already a opportunity of learning problem. And that is because the data at least comes from the policy that you have. I really architecture, like training stuff, the data changes, right? Because now you have a different policy when you’re behaving better. So now you start seeing new data. So, other fundamentally, uh, the non-stationary problem, and it’s, it is a continual learning problem. So, well, Vladini and others were trying to solve Batali, with neural networks, so to pay most if you can pay by that kind of side, the, the nepada field. These are the problems they were having, right? You try to train your MLB or you’re in small fountain or whatever you have and because the data distribution changes over time, that incorporates. So you can’t train your conference. So their solution was that you play about, right? Yeah, some of the is very buffer. And the replay powerful volatile is too, yeah. The, the, the, for the, for the neural allapidation, it makes the data relocation. Um, and yeah, so the replay buffer in general is a congenial learning solution and, you know, there is sort of the 2020 PPA, and then there is, there’s a list of fancy methods, which are sort of plays on, on, on, on those kind of ideas, I mean. Some of these papers are. Now, what’s the newspapers are off at UPR. But anyway, this idea of just playing pretty old. This is also someone biologically voucherful, so they just sort of punks up the brain, you know, stores in the people, numbers, if I’m not mistake, and they Italian, then it’s replaced it, either when you sleep or in sort of different kind of scenarios, but there’s this idea that really play, it’s actually important for biological systems as well. And here, you know, replay, which is also what being used. The idea. Sorry, I don’t have the slides. I’m gonna go here. But the idea is almost exactly the same. So you have this task that you don’t have access to. So you have some data for A, some data for me. And you’re replacing the lost or name, which is on the entire data set. With the lost computed on those few, they end up. You basically doing exactly the same, the same thing. So what what you’re doing is you’re such them playing. The data that you had for an airport, is it? to a small bath card, which is where you play bumper and you use that as a proxy for your entire lobster. So the same machinery at the end of the day. All of these methods, so this is what I was trying to say at some point and then making this assumption about the multicast objective. All of these methods, they all start from this objective and then they do some approximation. Yes. So in memory, example, the effectiveness of the regulation depends on the buffer size. Yes. Exactly. If, for example, in the Buffer, you don’t have one class, then, yeah, we can break it out because the, the, the laws don’t matches, right? So you, you need to collect representative data points that, you know, and that’s why, you know, I was mentioning before it was this works about compressing data sets and showing that you can throw away 40% of CP and so forth. You know, they bring these kind of words where people are trying to understand either some like data points that are more important than others in your data. You’d have to tell you, you’d have overall the project. Um, So, um, So, so this is like the literature in this place, this is a kind of question. So like what do you do when you have a value with the buffer, like how do you play with it and like what you put in, how do you throw away data and so forth? There is another family of methods. I’m trying to reify put it here or not. Yeah, I put it here the bottom. They use a genactive model. So the idea, I, I, I mean, it’s not my my computer. The idea is, instead of having a buffer of data, You’ll buy it in general, the data, and then just sample from that mouth open now. So you have a service. And supposedly this has again more biologically plausible and whatnot. The reason I wasn’t as excited on this paper that is coming out is because the generative model usually says too much because of the buffers. In terms of public data, you need to store. Doesn’t always work as well. It’s much easier if you have a buffer drive than than the trenches than the models. And also the internet model takes a lot to put you to trade. So, you learn as far. And now you have to be the technical model through data from passport and then there’s almost takes as long as very passport and then you move the passport because that’s going to tomorrow. So it feels like very happy, but it is sort out, but there is actually quite a keeping ceramic and there’s a lot of, and we can use alternative models to do other things as well. you know, once you have a good energy model of your tasks, you can do fasting friends, you can do other side things with it. So, you know, maybe overall if it’s worth it. Um, Yeah, I put a slide. Here. Because this is a quirky math thought, but I thought I mentioned it in case any of you find it interesting. So like we heard, um, When people started working on these things, at some point we ended up working on this MacBook that we called memory-based for a metre of production and we play. And the idea here was, there is this, um, There’s this notioning of data about the science about complimentary writing systems. And the idea is that the brain, you know, doesn’t, it basically has complementary kind of mechanisms to learn, that it leads to different things. And I, as we talk in the past, we’ve noticed this before for finishing as well, right? We have things like 10 years, paper, that I put stuff for proficient from Australia, learning, and then you have things like your networks that work well. So this MPPA, this MPA method, is trying to make staying in his neighbourhood with greater like these. And the way it works is, I think, is kind of interesting. You have an MLB that you learn. But then every time you need to do an inference, You have a replay buffer, you call the data that you see. So, you take the data point that you are trying to do, you just, you go to the replay power, but in this case, the replay power is huge, like solo data. So you take your data point and you search 4K neighbours in the big rooms like buffers. You repeat the gay neighbours. You train your model on those gay neighbours, and here you just trade off the layer, but it doesn’t look the recycling job. You trade, the overkater model of those K labours, and then you throw away whatever you’ve learned. So here the idea is really taking this. The world is bigger than the agent kind of new to heart. And he’s saying, anytime before making a discipline, Let me repeat my model. The data are very close to me so that I very well have church out of the local structure of the world around me. May quite predictory. And I provide this model gators is not useful, physically, because then things out of the local space are… So I find that idea kind of work here, right? I thought I’ll mention it. The other, okay, so this is another paper that I was meant to mention. So, uh, Another thing that you can do. Please, um, Instead of picking data points from your replay, putting your replay button, you can learn the data points. So the idea is if you are, you assume So basically, okay, so the reason I’m trying to present this paper is because this is trying to show that there is a continuum between memory-based solutions and regularization-based solutions. So this is actually the kind of the same thing and you can very, to do the same thing. And that’s kind of also obvious when you look at that multitask objective and say, I’m just, I’m just approximating it by doing a few data points. But basically what this method is doing is saying, I can only store 5 examples, instead of picking those 5 examples from my training set of the previous task. Let’s to break and descent on the examples themselves, and learn something, that could represent that loss. And you turn it out that you can compress your data, that’s even more you do that. So the data points you and this cloth here is trying to suggest that this is coming from the ocean processes, when they have something like holding 0 c points, that kind of back like that. So this doesn’t matter. But the mechanism is really just this. You take your model and you learn your inputs, such that the laws play the same. So you have your normal model with all the data from transport, you have your model, the same train model, but now you’re learning acts type, that the loss of distance is the same with the loss of that thing, and you propagate. So you do a means to get out because the losses and then you, you propagate to your, to your ebos to line them. And well, usually comes out of it, is dying out, stinks that look nonsensical. Like, for example, if you do this on, I’m nice. We get this picture, they look like production noise or like something very weird. But if you put them through, you get a very similar dose to what you are having, um, if you had the real very time, and this is just somewhere for pressing the data, and now you can think of this learning data points, you can think of them as parameters. So it was like, it basically becomes most, more, more like a situation thing. Um, So, people assume, Jedi, speaking, you know? Okay, so no, not able to show. Empirically speaking, usually these kind of replay methods are the ones that work like that. For games, I think. So usually, yeah, if you, we can ask around, you go to say that replay is what works the best. Yeah, usually the ones that are easiest to implement. Usually it just stores on data. And then when you’re doing your learning tasks too, you add your mini pack a few examples from the previous tasks. So you don’t need to change anything in your code. They’re very initial to implement. They don’t want it usually works your best. Uh, The issue is the more tasks you see, the more data you have to consider, everything that you do, because you have your mini bags from your current task, and then you have a mini bags from the PVS pack, and a mini bags from the past, before that, and another mini bags. So the number of data points kind of just keeps increasing. Uh, with the more and more, uh, things you you are. But, but I usually consider to be the best performing one. And the reason that considered to be better than anything else is because they don’t make the state of the factor, of course nature, right? So here, The approximately the loss by by such temperature, the data, but that’s not the same thing as doing a daily pressure, but you’re not able to train for a ton of photographic approximation around your.. Like any questions so far? Yes. I’m sorry, I don’t really get down, like taking data. You said, taking it up, playing from, I mean, most. I don’t get this usually. How did you think, you know? So you have, you have your, yeah, so sorry. So, you have, like, the culture thing about, with computers, right? Because other ones, the computer laws that you are not saying. So the idea here is you, you have your task one. Right? So now you’re going to pass, I should get a market, right? So. Okay, so you had your local tough one, so the 12 days, L one, that’s on Tita and some beta. Data one, the 1st thing that’s it, right? So now when you go to the 2nd task, You’re not supposed to have access to the data set, like data, you have access, but the data set is the one that disappears, like you’re now moving the data set. So instead of keeping this, what you’re saying is I’m going to subsample these data, so I’m going to pick a subset of examples. So I’m going to construct it, if you want, build that. where I, it has like a lot less than your data set. go. So I’m just like, you know, if I have, uh, I guess, uh, I mean, they have 50,000 data push, right? To say, like, I ID, I picked 500 data points from those 50,000. And now when I’m computing, when I need to do L1. So like what I need to do next is L1 Gita. D, one, plus L, 2, D, D, D,2. This will be my multitask object. This is what I really want to do, right? I want to learn this to that together. But I don’t have this guy anymore. So I just replace it by this. So just use only those 500 examples and I assume that the expected gradients on these 500 examples. So if I if I compute the average on this 500, the expected gradient is the same as the expected ingredients on the entire data set. That is kind of the assumption. Obviously. Like, I guess for those who’ve done a lot more statistics, you know that like the number of samples, you know, the fewer samples you have, the higher variants you have. So in some sense, you know, the, the, the, the gradients on this smaller data set and the gradients on this data, so they’re supposedly are unbiassed, uh, you know, so this is kind of unbiassed, so that they should cuddle in expectation pointing in the same direction. But obviously you are not resembling them. You sample them once and you keep them fixed. So because of this, there’s going to be some bias here. And and then because the variance is high because this is uh, this is small, like you, you might be quite old. So this is kind of the issue, right? You need to pick those data points wisely, because if you don’t, this this thing might be quite different than this quantity. Because you, you know, if you would be allowed to resample them at every step, then this would be a non-problem. This is basically just mini badger SGD, right at that point. But you’re not. It’s almost like you only pick a mini batch and that’s going to be your mini batch forever. Uh, But yeah, this is the intuition is you’re just approximating the data set by a smaller data set where you’re just jumping. Um, and this is sort of just a valid way. This is just yet another way of approximating the same thing. Um, Yes. Can we say that replay turns continual learning into IID learning? Yes. Yes, absolutely. That is exactly what he’s trying to do. I mean, the regulization method is right. This is the same thing. Just using a different kind of approximation. So all of them, like this multitask objective that I’m writing here without just your data. This is the idea like, right? This is just, I almost asked at the same time, the ID lining. So what you’re trying to do is you’re always trying to approximate ID learning by coming up with cheaper version of what is the correct thing. And reply is trying to do that. If you do replay with infinite memory and I, the sampling from that memory, then you basically don’t get the correct thing. And that’s why people typically say, no, you need to have no constraint on your memory, you can’t figure it for everything because I mean, it’s not that you should have just asked. I don’t get the difference between these techniques and multitask learning. Are we trying to approximate multitask learning or what’s going on? Yeah, so these techniques are meant for the situation where you are not allowed to do multitask learning because you have to see tasks sequentially, but all they’re trying to do is approximate multitask lining. So here the assumption is like the robots in the real world is the example, right? Like you can’t do multitask learning because You know, like as things changes, like you can’t really store everything that you’ve done and you can’t revert time to go back where you are to like get another observation. So you have to, you have a stream of data that comes at you, you have sort of data set that come at you, and you can’t do multitasks because you can’t have all of them at the same time. So you need to approximate this process somehow. So you do that by either you store observation from time to time, that will kind of be a proxy for what happened in the past, or you build this regulization. But yeah, I mean, the goal of all of this method is to simulate the multitask lining objective. And the point is, if you’re able to do that, that you can learn in a known, Um, Yeah, in a, in a non-ID setting, in a sequential setting, because now you’re replacing this plant that you don’t see by, by, by, I think, yes. You mentioned also, uh, regions go into, uh, decide on how to select those to return to the term, but… Yeah. How did you talk about? Yeah, so a bunch of these patients are just like curious things. Um, that curious takes a lot of time take into account how long ago you’ve seen the data flight and things like this, you’ve made some kind of process that would allow you to do the right thing. This particular paper. Here you can do 2 things, you have a data set. You can either frame this as a normal optimisation problem where you learn, acts, prices, a lot is the same, or you can practice a discrete optimisation problem, where what you’re optimising is you basically have like a weight in front of each data point. And technically those weights can only be near worldwide because if that’s meant to be a select, why, it could clear up its life. Um, and you’re like optimising over those ways. But then you’re specifying those ways that here. So the whole point is, like, that would be sort of the way you produce the relaxation process. So the whole point is that instead of learning the input itself, you are just learning which they are going to syllactically, you may be some of the relation problems that we have to optimise. There’s no, like, theory behind, yeah, absolutely, like, the present points, but there is, like, not, I, I, I don’t, No, I don’t think there’s a lot of fury, because it’s not very clear. Yeah, there’s not a lot of theory. There’s a lot of, like, you literally kind of drinks that there’s people too, basically like kind of compression stuff that they used to be doing. I. Um, And that’s sort of the 3rd family, that’s kind of the last 2 argument. And this is one of your discussing yesterday as well, like the progressive natural action. I mean, there is back some photos on this. It’s really this idea. So saying, I learned I learned to ask one. After I environment, I preed the wakes that I important Prataswan. So here, for example, these are the blue weights that are being used with have spot. And then from then onward, I’m gonna learn that to on, on, only in the sort of the, the units that happen in the browser. So, the near approaching might be the nicest one. So here, the idea was, imagine you have an MLP, you’re like attached to one, what you do is you prove the units that are not being used for transport, and you end up with this very small MLB here. So she may have something good to know which ways are important and it’s ways are talking important. Usually the challenge back to this kind of such security. And honestly, if you do this and the analysis, you decide which and it are important, and you throw away with them, that’s not important. But I have a shower about people. You don’t know what you’re going to do, if you’re going to add back capacity. That sort of be entalized would be just sort of as a normal NP. So basically other new columns here. And then this whole column is just a new MLP that is training. The real problem is frozen. There’s no change in these parameters. And all you’re doing is this new penalty, is able to read the teachers that you’ve learned previously. And the idea is he previously aligned a feature that is useful for distrust as well. You’re allowed to use it, but you can read it as input. But you’re not allowed to change anything that happened in any kind of stuff. So this methods, what they’re nice is usually you have a guarantee that it is ultimately not forgetting. It’s a good, literally like everything on the parameter, so you’re not allowed to save anything. But usually in one form or another, they end up increasing the number of parameters with every task that you do. So this is just basically a play, like the simplest form of this, is you just learn a different model for each stuff. You learn a different one for each task, then obviously there’s no problem, right? It’s like each, each, each model, like, sits on that. There’s no interference, there’s no forgetting. Everything is perfect. The only thing that you don’t have is you don’t have any transfers either is you’re just learning the fasting dependent. So what this is saying is, well, if I have to go through this process equentially anyway, because this is the structure of what I’m doing here, I might as well take the previous models that I’ve learned. Push that they have through them to see what features they come up with. And use all the traditional inputs to my model. And the idea is that my model that you can decide, well, this feature is already useful, I’m not going to be learning it. I just going to use the, you know, for the rest keeper. This is kind of the high level. I mean, there is a mark, a lot more. You know. things that people can do and not not, but this is sort of a generic kind of view of this modularity or, I don’t know, model expansion, there’s different names that people use for it. But it’s kind of, this is the, so this are kind of your choice. You either build a regorization there. that regularises the entire models. You either start the data in some form, and you don’t need like free numbers, or you’re treating the weight some particular way, and then, you know, you add new weight during whatever is yours. These are, the prototypical 3 family of knives that people have done for a different country. And continuing our rest in this discussion about modern learning, I mean, that’s similar, very similar to more than American, but solving all things. For example, when I’m managing 2 models, I’m trying to remove the non-used weight for task A and. Yeah, yeah, it is. Yeah, you take a, using, you like the system. Yeah, then you would like the projection, yeah, but those are going to become partly pointed. I mean, there is a lot of IQV doing. Actually, there is, uh, Sorry, okay. Now, I, there is also like projection networks that are even closer to model merging that are kind of okay on this where the idea there is that, uh, Instead of angulating gang connected, the way you do. I think I have this plug of it. Well, you just being the protection of the gradiants. So you’re saying, okay, there will be some space, you’re not allowed to, like, put on the person, and then you’ll be telling the gatherers. This is also very close to modern noising. You already have the primary day, in Jersey, for them, together, and you just need to figure out if it’s up, right? There are a lot to change, but there is sort of These projections, techniques are in between these categories. So these categories are not perfect. I don’t necessarily feed. Nice being any of them. But this used to be the prototypical kind of categories of matter before. And going indefinitely. Oh, it’s uh, doctor is like… So this method is like the model keep growing. That’s a problem. So controlling the drought, it’s sort of the question. depends on what you’re doing. So people usually complain that if you have a corporation that this can, this drone would be like, look at, look at rhythmic, or you hope to get something near in some way. But the whole, the whole goal is to have the growth to be a disability yet. The other thing, as I mentioned before, as you keep adding these features that you’re in from, they start acting as noise and they actually, instead of seeing forward transfer or positive forward transfer, you see negatively. So basically, you start diving slower and slower because what happens is you have a lot of this type of features that you get from the previous learning models. And if they add, like, noise, it makes it really hyperlink that you send to figure out how to learn any task. Um, So, Yeah, ACD is sensitive to the signal migration at some point. So in the signal to noise ratio is low enough. It will figure it out. If it doesn’t convergence, you’re going to slow down. So this is usually seen under 10 years because the whole point of this matter is, you should see some kind of work. perfect. Something, how many is end of my own. Yeah, so I reimbrace that. Okay, so this north of this family of novels, at least, right, and the kitchen or sandwich, you don’t have to make the others. But you have different output hats. for different tasks. And then when you have a different time, you just pick your right off the pad. So you kind of assume that at the implement time you know which task you’re supposed to prove. And because you know which class you’re supposed to do. You know which of these columns you’re supposed to use. But yes, they in this initial issue that you need to have this kind of routeing information. Well, for the regulation, one, you don’t need to have it. It’s, like, it’s only at least within a model of other kind of blood, the routeing by itself when you’re when you regularise it. So yeah, so this is a, here is another… I mean, this is the method that gives you like 0 that that’s what you’re forgetting, but it makes the livestumber consensual in terms of like what you’re supposed to know and things like that. The least elegant one, I guess, to be that way. Yes. Okay, you said that the regular presentation results actually has, uh, that it’s not good for TikTok that I’m not thinking of. Yes. So which are the really best food? So people would argue that we try to reply is the best, those kind of scenarios because in reply, you don’t need to make this tailor approximation. So, in a few slides, uh, we, we, like, these advisements on sort of larger scale, which kind of data sets. We, sorry, so what happens is you go to these sort of financial data sets and you run regularization-based matters, then you are rather implying math. And you find metals work way better, which is what everyone expects and everyone thinks that the base. So in that paper, what they try to do is to see why there’s a king. See, for the reason that he probably, they think that they stay like fashion breakdown and the regularisation can do what is the process. And what we see is when you run your greenplay. You still converse the solution that we deem the tailor expansion of the previous task. So what we see is that it’s not the assumption that breaks. It’s a different reason why it is the word. So I’m just putting there as a cardiac. Empirically, we noticed on relatively bigger data sites. But these 2nd other attractions doesn’t seem to be that problematic because refly quite exactly the same kind of solutions. But empirically is also true that replay all of its works better for other reasons and then we don’t understand. But yeah, in general, among, Okay, so this method is algebra method, so we forget, so in terms of catholic, which is optimal, but it has a lot of this other tag, yeah, so you need to know what the task it is, the normally you think. The replay, at least along across this, like, 3 family is usually considered kind of being prepared matters. It’s also widely used, even if it’s not acknowledging the continuing problem. For example, even when you train LLMs, you have forms of replay, you have people in the building. But like you always do replay when you find you and you have at least the people of the data that you have deploying a data set and mix it up and stuff like that. It’s very easy to implement the book. Um, So still on the modularity, I think I wanted to mention that there is this paper, I put it in here because of the discussion yesterday. So this is sort of the wake sleep thing, right? So what it does here is that, in the weight phase, I add, you know, a new column, and this new column, beginning of learning the dependency, and all these indoor features, from my knowledge rate, it’s a political fault. So, so basically it’s like a progressive network. It’s like watch them for a servistic network, right? I add a new model that is learning. The previous model is frozen. Nothing can change in this one. And I read out the teachers from the previous one. I learned that this was the waste page, and then I go into the sleep page, where the speaker is, I think, is this product that I just learned, and uh, compress it back into the knowledge page. And when I’m doing these operations that I basically using a recognization based matter. So the engineer is, okay, when I learn something new, I use the technique by location network, because that’s the work that’s most likely to give me forward transcript. And so large, much faster than you have. And then I’d have this sleep train, so I’m not interacting with the world anymore. What I’m doing is I’m using regularisation, basically, you know, lightly, because there’s this new thing that I’ve learned, it really speaks the public. This is, like, yeah, sort of, uh, by the brain, so forth. Um, So this is your summary of what I said, because of this, um, because the teacher, the size teacher, that we have, uh, noise, you know, you can look even at the, you know, paper, you can see, negative transfer, and gaining platari, and so forth. So that is a problem that is injured, I’m not worried about. Um, This is the, I mean, like we finished the 3 family of matters. This is like the total normalisation. This is closer to what I said before. So this method, what you do, is you are building this projection. So when before you’re doing your update, you, You move from your gradia, or you project your gradia to a subspace. That is different than the sun place where the parameters are important. So Yeah, so for example, if computer attention, right, your attention is going to have something else based. So there’s going to be some directions where your has, like, had sort of 0 item values or very, very small item values. Those are directions that you’re allowed to move. So now you’re doing, you’re what you’re doing in the projector, pay the answer. So you basically take, the data can only move in these directions where the attention has depends, uh, a very small, uh, So like, okay, I’m just trying to see if I can change in the morning, 54. So if we go back, I don’t need to get it first. Um, It’s almost like in elastic consolidation. Okay, so, okay, so usually, I’ll keep on it, if I can write this, where it can be. So, in, in, in, in a lot, in, um, in organization-based methods, you have Tita N, because Tita and minus one, minus some radiant. Plus, previous solution. And some regularisation, right? Some, some, some, um, at, yeah, some metrics, right? This is so this is your normal update. And this is your regularisation third, which is basically pushing you back to data start, which is the previous solution. And it has some waiting for 5 minute waiting. In a regularisation method, what you do, you do something like Visa, minus, minus, ETA, and then you have a mask for your projection, the gray idea. And this mask is basically, Zero. Everywhere where F was large. And is one where F is small. So does that make sense? So now instead of adding this additional turn, you’re just basically saying, okay, this waiting here, tells me that there’s some directions I’m supposed to move and some direction I shouldn’t move. Instead of adding this force that is pulling you back, I just simply say you’re not allowed to move in those directions, I create my projection. It just says, well, if that is lagged enough, just never move in those directions so then you don’t need the force to pull you back. And where app is small and you’re allowed to move, might as well don’t regularise the dog, just allow it more. That’s kind of what the projection matter does. Pretty kind of tweeter, right? So if app is large, You used to have, you used to have a strong force that could kill you back. So instead of doing that, you just put one put 0 in the mask and you just remove the component from the grain. So you’re saying never move in directions where I had a strong force because, you know, if I never move then I don’t need a force anymore, I’m never going to change. And then where app is very small, where beside directions where I would usually normally move by the bay, I just don’t regularise those directions anymore, I would say, they just ask whatever it wants. So this mask usually, I mean, there are different kinds of projections. It doesn’t have to be a 0 one mask, but this is maybe that’s interesting. The hard projection would just be like 0 ones. I’m just saying there’s some directions you’re allowed to move. Some directions you’re not allowed to move the dogs. Um, Yes. He was asking first. Okay, cool. Done it. He was asking first. Okay. Ah, okay. Uh, sorry, but… So, continue learn. Can we say that continual learning? Depends on, uh, how far the parameter, the parameter vectors for each other, if they’re so to each other, then the continual learning is easy, if they’re far, then catastrophic for getting it. Definitely. And usually that is true because if they’re close to each other, then you can do these tailor expansions and then and then sort of lineality works. And in the ID kind of, you can, you can reason about how much my, my function is going to change if I move here or not. Like if things move very, very far apart. The linearity, you can’t linearize the system, right? You’re going to have higher order terms, and that makes it impossible to tell what you can and cannot do. So usually most of continue learning algorithms are trying to find solutions that are very close to each other. So where these kind of regularisation terms and everything else works. Um, So you were… Um, Sorry, I feel like I don’t have an account. So the VAE was here in him for, for, I don’t remember what I said. This was in, uh, for a deep flight, maybe? Yeah, yeah. Oh, right, then you’re, you’re, like, whether you can go actually, like, the, like, so you can learn your GA in, this, nice, and place, and quickly play in the right place. So there is. quite a lot, if it’s a side of matters. Yes, it helps a lot. So what you’re projecting your daytime to a lower dimensional work. Some better structured place, it’s easier to do your generative novel. If you learn practice, it’s more reliable, it’s really a good fun feeling. Uh, the key or like what becomes hard is how do you how do you build this place in place? Why did you know that make of the prices, okay? And like what is true is you didn’t get that wrong, it doesn’t work. If you have a real life of space, it works way back, right? But there’s definitely quite a few papers. But this is definitely something it’s called a later generative way play. Because then it happens, like in the spices, it’s the whole, we need this afternoon, one of people, the English language. And there was another question after the question or deliberate. Yes. How would you require the 1st hold on? But I can find them. How do you find a mask? Well, like, the 1st call, it gives you the, I mean, you want to, I’m seeing what this movie is. So your 1st computer app, where app is just extensively analysis, just your hash app, is a diagon of the hash app, and then you decide on a threshold. of the, of the, so the, if you, if it’s diagonal, you decide like the icon values of your attention. And then you decide on the, and this is a hyperfrimeter. You say, like, anything that’s smaller, the intention, the minus one. It’s gonna be uh, one, everything that’s bigger than 10 to the minus one. You know, I’m not allowed to move. But this is really a hyper parameter that you have to tune by hand, right? You compute like the analytical part, like computing the fish, doing the analysis, you know, there is one way of doing it. You just go through the map, you commute it. But then deciding how you want to project, then it’s going to be up to you. Like, how conservative you want to be. So you have to decide the threshold. Some people do something where They scale down the learning rate, depending on the manages of the t-shirt. So it’s like the soft version of this, where it’s not saying that you’re not allowed to move in this direction, but you have to move much, much slower than in this action. So there’s like softer versions of this and so forth. But yeah, there’s no magic answer. There’s no true answer. Like, you really just need to tune it and It’s not that sensitive, like, usually there is a threshold that kind of works for most cases, but. Okay, so I’m thinking we should do the 10 minutes, right? Probably very badly. And then we get nonbility ideas. Um, I had another imply here. But I’m not sure if I have this. Well, in a big part, in a spy. I’m not sure how much I should know about this, but. But, Generally speaking, like I said, what this life is trying to say, is that, If whatever I ask one. Even from a version of flatter minimum. And even if you don’t do anything, what would I have to do, things are going to be bad. And that’s just because if you are in a flat in your mouth, you can work a lot further from where you are without the facting the loss. So, um, I mean, I don’t know. That’s kind of the high level idea, right? So the, like, you can help, so I think the writing, not necessarily by focussing on the interaction between, it has one and times. They just look at the time and pass and see, you will ignore it all the possible that’s maybe mine stream after it is. And just try to find us, like as possible, mini market, how I pass. And that will have significantly, 15 pads for the amount of a get in the industry. So this is the high level idea. But I do hope I have so much light on this. I really spend this a little bit better. I think one way is talked about, is that It will turn it out that the choice of architecture that you have, or even the choice about violet, sometimes it can work or have an equal impact on forgetting. And that’s some of the alcohol that will be described. So for example, we were looking at the cheap part. The CPR or my product. I think it was pretty far. For example, at the end of it, because it’s on a bell point at that point. Um, and by convention, everyone used was usually the restaurant. That was sort of the combination in the field, then you want to do the press method, then you’ll put your algorithm of choice on top of it. We pass on the play thing, because some of the normalisation thing, you got your, you got PC or whatever, it was, your recognization, based, my, that’s not. And then usually people won’t compare things this way. And in a word, with married others, what is done? Please check, okay, what happened if I replace the restaurant? With a BGG landmark, which is just, you don’t have no more than that. So I haven’t changed anything, like I’m still just doing later itself to my training. I’m not doing any continuing matter. I just, you know, like rest might be the best thing. But BGT works equally well on your data, like let’s just defend that effective scene. And we saw that the BGG forgets a lot less. So which is it, by spending the architecture, we’re working as well as it, obviously. So, you know, you start out of adding this fancy also to the WhatsApp. You just speak as different architecture, it doesn’t the upward thing as well. And this has to do with the kind of solutions, different architecture is fine. Um, so. So this is kind of the point that I’m trying to make here. In case I don’t have this in a toilet slide. Are there super interesting results that we had was that we were doing replay. But recently where you had like infinite era. So, so imagine we do that to work and then when we go to class 2, you have access to the entire data set as one, and you sample ID from it. So, like the, like a really badge. So basically you’re doing multitasks, but you’re doing multitasking as information. So your question is, that’s why, the thing like, that’s part of that’s too, and so forth. And this will work extremely well, obviously. But what we wanted to know is whether this will push you outside this place where the 2nd word data is by the Indians passport. And the outcome of these experiments, even when you eat quality tasks in a row, was very fast mine. Is that your 1st pass? Do I take your expression on your 1st pass? And then you you, you move, you know, you go where you end up after doing the guitar, can you check whether the type of person still holds, like how much is the approximation error of the data expression? The approximation error is super low. And we found this extremely interesting, right? Because the main argument in the field was that the replay works so well, the multi-pass works so well, because No, we need to make this data expression, which is probably wrong. So like one interesting outcome on this, that’s right, mental. Well, then, why what’s going on here? And obviously we don’t have a final answer, which I can tell you some observations. So 1st of all, when you do regulization-based methods and you have to do this data question, that’s not your approximation that you make, right? You usually make the same expansion, but that you make the questions, they have the valve, you might just raise the attention, much more various, you might remove this here, there. So one, one possibility is that all these extra for the information that you make on the way. There are very problematic. But if you just do the data expansion and the change, Maybe that would be fine. So this is one possible answer, right? Okay, okay. The other answer, which I actually find even more interesting, is you can look at the regorization matter. And you can try to look at the directions of low curvature. So basically the direction where this app is small. right? So what we did is we completed this and we asked, what is the subspace, you know, which my recognization method moves. So like, let’s say we do these projections, you know, what is my mask? Which are the directions I’m actually moving and which are the direction I’m not doing? And then look at what my, uh, replay metal does and ask, easiest moving in the same directions. This mask is telling me. I’m supposed to know it. And it turns out that it’s not. So to check it out that, you know, I mean, I mean, Like disagree. Right. So, If this is my mask. Right? Let’s just say, These are the directions of high temperature. These are directions are not allowed to move, right? And then there is these directions that have low temperature. And this is sort of my legalisation method will say you’re allowed to move in this direction. So each of these is the strength, the long one axis, one direction, I don’t know if you understand, not I’m trying to grow here. But this is a such place you’re allowed to move. It turns out that the replay method uses a smaller subspace. This is what replay uses. So one thing that I found, it’s really interesting in this kind of experiment is that it turns out that there’s some directions where your Hashian says, You can move here as much as you want. Nothing is going to happen. But then as soon as you start moving, because of high order charge, something like that, the hasher suddenly becomes large and things will change. And there’s some directions that have no temperature, but they’re stable. Like you can move in those directions for a while and nothing happens. Um, and then somehow if I can distinguish between these two. Well, if you just look at the hash and you can’t distinguish between them, because they’re just directions of location. You can distinguish between them, which you compute the 3rd order. Because the 3rd one, the reality will tell you how fast is my national changing. And that’s all these things too, right? Because as you go higher, in order, it generally says how much the previous one is changing. So 3rd one is the university, you might be able to help you plaster your your item value into things that are stainful and things that are not stable. Anyway, I’m just bring it up. This is sort of a right direction that I, well, I, some students I work with are actually looking at right now. How is the replay? How is the repay able to text with your consideration to start on Apple TV? Yeah, so it’s not even trying to rely. So replay, it’s almost like you take a step. And then the rapier that is coming from the replay is pulling you back in all directions that are hiding the loss. So, increase the plug, like, you take the, it’s the same as the requires that you take the step in all of these directions, but then the, the, the great jump from the replay is going to pull you back on all of this. Because, you know, even even the actor’s more perfect, you will see the compared to the regulization that is, the regulization method could do that as well. If you recompute that I share. So like typically what we do is we compute this map, this app, or app, or whatever. We computed once, once you’ve learned as one, and then we keep it fixed. We never updated. But if you would recompute it from time to time, you’ll see that, well, actually now that I moved a little bit, this is not correct anymore. So that would be another way of fixing it. But, yeah, people will. People say you shouldn’t be able to decompute it because recomputing it means you have access to data from that’s why, which the whole point was that you don’t have access to it anymore. So if you, if you have to recognise it from time to time, the kind of things that you saw it, the data, and keep it along with you. Right, everybody there, the whole point of the operate on. But that’s, I just think this is a kind of interesting kind of observation that, which is maybe not as surprising, but it’s like, you know, there are directions that are, that are not sensitive for your loss. that are stable and some of them that are not staying involved. And that could be part of why replaying what’s better, then, than my violation. Rather than the whole tailor personation is that people usually like to bring up. So, um, Yeah, this is just some results from the paper, this is the paper that I mentioned. So there are multiple actions. So I told you about the VPG virtual, uh, restaurant. Here we do something else, which is we just make the network wide that. Or we make the network deeper. And you upset that you need to make a natural, it’s deeper. It doesn’t help you at all. If you make them all the wider. So you must increase the weight, then naturally, without applying any kind of functional lining, alcohol. You, you, you have less and less forget. So the wider amount, the less weight I think you have. And furthermore, I guess we just sort of like, this is just the empirical object range. But, you know, it does some very little biting behaviours. Your grade has become more and more to run out. Maybe that’s not surprising because, as you make, the mouth of wild, you’re in high, the main 1st place, random factors in high dimensions, they are important to each other. But it just sort of, as you make the the model wider, you have lesson, yes, because the data you have more of the model. And the other thing that I find email it more interesting is the greatest becomes twice, right? If you look at the greatest itself, If you look at the distribution, like a huge programme of the magnitude of the entries of the gradium, they concentrate around zero. The wider the model. And it’s again how they have like mercurians. And this is, um, This is the, the, the, the, the different architecture thing, with the rest and the, um, and and the, um, Sorry, the VGG. I just don’t see the GB there, but I think the WRM, the countries are the medical schools are affected. But anyway, you like, you can go to the paper, you have to paper the both side of where that’s work. Um, and um, The point of all of these papers is that in the field, we typically focus a lot on the algorithm and we come up to both these kind of algorithms, you think, but the architecture itself matters a lot and sort of decide in the modelling methods a lot. Um, and in man, there’s a lot because of this plan cinematic. That’s my increase in their connection. So, I just like, though, you can change sort of the behaviour of the continual learning problem by the data. It’s something that’s being kind of ignored by the field. prior to these words. Um, Okay, so I can skip this. Like, this is the one interesting thing that I’m mentioning, with this table, an unstable subcase, so I can skip it since there’s nothing there. So, I wanted, okay, so under those room, okay, that’s what you said. I’m going to go through a little example. So when I started talking, um, about continual learning, I was trying to make this, um, This argument, that if we all continue learning, then we might get more efficient planning, because you have all of this reputation for different events. So if you somehow are able to deal with project, you know, even if you don’t have a equational problem, even if everything is IP, you might still screen things out. And um, I just want to show you now very enough, then we tried. That did not work out, but it still doesn’t work. But just to show you that the problem is not really, I’m making it sound. Um, And this is also connecting a little bit of critical of learning and all of this idea of get to feed your orders of the day, exactly, and so forth. So, and and then this is, um, Yeah, there’s a lot of text here and I, I, I think it just doesn’t make sense because for all of this people in front of it. This is based on some of citations. So this I just random observation from day perks, that I saying that as you learn a model, You have this pregnancy that the model knows more about the last thing because trained on. So this type of observation come back from the 1st time we started in China on Wikipedia, where if you are looking at the imperial articles, the code. Yeah, because the traitors have, you could answer much by the right question of the Bathish, and if you’re asking from early, you need to do the articultural work, worse. There is some kind of similar attracts, you know, in selfie newspaper. There is some conservation we made in the existing act of the Indian football. So it seems that there’s an optimisations generally have has this bias to overtie the relation data. And like, so the undercome we tried, it didn’t work. It’s actually quite simple. You say, okay, this is my data set, because I start learning. So what I’m gonna do is I’m gonna regularly look at the performance that I have on the data points. And I’m gonna class that my data A2, heart example, any data based on the laws. I decide I’m trying for the last, and I say, everything that allows to snow is going to be deep, everything has to be absolved. And now, once I, uh, so this is the after the 1st step. So once I plan this, now, I think of this as 2 tasks. This is task one, this is task two. If I think of this as a continuing problem, is I have a task one, that I’ve already lied. This are only examples that I’m able to pass by, and I have attached to that, I have not lied yet. So how about I find this will continue with any problem where I’m just looking at tasks too, the hard examples, and I add some kind of regularisation not to forget that spot. And the idea is that I don’t like on progressive, the higher examples that become smaller in Walmart. And you’re actually going to stay in a lot of computer because you’re going to replace the updates you are doing, the green bridge, by some regulations for that history. So the high level idea. Um, how many brothers? And it did not work. And the reason it didn’t work, it’s a great kind of interesting. Okay, I removed all the slides I had on this because there were way too much detail and I felt like I got into much detail on how important building. When I listen, it doesn’t work. Okay, so 1st of all, I have to be honest, like, we don’t really know why you should be back in Alpha King. Like this is immediately, it makes a lot of sense. Actually, this idea is not even new. We talk multiple papers, trying to play, that they’re trying to do the same thing. We try to replicate some of those papers, they don’t talk either. And our paper wasn’t quite an either. So I can tell you something about this. So one hypothesis is that If you look at the data point, Can you just look at the whether it’s awful or not? It doesn’t mean that the data point is actually solved. So the problem is, For example, a model can classify the picture of a card. There is a card because the sky is closed, right? It can rise to some some serious correlations. So, say, the majority of features of cars in your data set have a blue background for, like, lucid ID hydrate, right? So then the 1st thing that the model might learn is that anything that has blue in the background is a car. And obviously this one means classify some of these examples, which are not tied. But for all the cars that have that, there will be seen all the examples. So what you’re going to do now is you’re going to remove from your data set, all of the examples of cars will look backgrounds, because you assume that you’ve already learned that part of the data, well, actually, you have not built the right features to detect the car. You’re detecting something much deeper. So there is this issue of Looking at the laws is not indicative, whether the model has learned the right representation, to be able to deal with data for it. This is one hypothesis about going on. Um, the other thing, but this is maybe, I think this is the main reason, if I’m honest, the 2nd reason is what you’re largely become hard that sometimes can be outliers, like misclassified data points, right? You do this on a large scale, and then that will just outliers. And then what’s going to happen is because now you’re focussing so much on your appliers, you’re actually going to harm, you are classified. Like a normal learning legend that will just learn on the entire data set. But at some point starting going out like it. But because we expect violations and weaker than actually doing radio descent on those data points. You’re basically, it’s almost like you’re waking up your outliers. And your normal starts paying a lot more attention to our driers, then it’s trying, then paying attention to the mode of their distribution, the son of images, and again, you get who it is. So this is a potential other justification that we have. But all you know, I’m just, I’m just trying to say that, um, this is sort of a way to construct a particulum automatically where you’re kind of filtering the data as you go through learning. Um, I was very excited. This was the research I was doing last year. I was very excited about it. I thought, we tweaked it now, but when I get everything to details, that’s more efficient. I just doesn’t want to worry about. So now some of this is now that you can’t rely. You use this, even though it properly, before the details, You have to converge and it has to get a recent performance. Like you end up using us might compute, uh if you didn’t do any of this. And the reason is either because once you get into outliers, you need at the end a lot of outage to figure out how to deal with that. Or because you have these cycles where you are protecting things that are not useful. And then, you know, at some point you need to realise it’s not okay and you need to like pull them back in that heartset and all of this stuff. But um, But yeah, anyway, I just wanted to give you like a much more research, sort of more recent flavour of like how you can use these concepts from continue learning. I mean, the thing I liked about this project was this, this happened at a point where the community, There was this talk that, uh, sorry for getting is not useful, and we shouldn’t, we’re working too much on it and we work on other stuff and so forth. And in this work, it’s all about catastrophic forgetting, right? That’s why I was saying like, no, so you’re forgetting it’s useful because it might help you to learn very efficient thing, but yeah, my beach did not work. So, in the end, it is not very bad. Um, We still have 15 minutes. So let’s see how go through a couple more slides. So this one is all about continue learning so far. So just wanted to say that um, This idea of wanting to learn faster and faster. So we talked about what happens to be forgetting. So this life, it’s probably a badly done flight because it doesn’t do any context, but in this life, what I’m trying to talk about is we have pathosomic forgetting, the other thing we care about is for it transcribed. The reason my poor transfer is much, much harder than that’s what we’re forgetting. is that I believe that poor transfer has something to do with this idea of proposition algorithm. So, you know, compared to, like, indomain, general attention, so, like, that was the terms of all of this stuff. When we say that I learned to ask, A, and so type B, and I want to drive faster, it means that sorting class B implies recomposing feature, that I want to realise it has one. Um, and I’m just sort of like, I mean, I think we talked about composition, identify the bit, you know, one of the GPS lectures, but I’m just wanting to say that, you know, there is a difference between competitionality and interpolation. And even networks, as far as we know, I don’t break the compositionally. And that is, in some sense, quite for the monthly because anything that has to do fast, for that very fast. I believe, at the end of the day, we’ll need to have some form of permission, I think, to be able to do that. And, um, This plan, this is okay, I mean, like to interact this in RL, but it doesn’t that matter. This kind of shows why compositionality is high. So here like what we’re doing is um, Let me see, which task, please do their goals playing. Um, that may be the Toyota. So in this task. So what you have to do is, you have to, you have to stream for current, you have to watch over it until you find the Apple. We find the Apple, get some reward, and then you have to go back home. That’s sort of the task. So you have to go forward, you have to go back. So what you do is we decompose this into different tasks. When the 1st task, you just have to go forward and find it after. You find an apple, you get your reward, and you actually do work, and you train your system on this, and it gets very good reward, and I’ll be in virtual. And now, you know, to create our propositional task, you say, okay, in the 2nd task that you have to do, you have to go all the way there. Yeah, yeah, published in the past during you know how to do, and then just work it backwards, so you start it, and then you get your reward and everything. And when you start spine training on the 2nd class, what happens into the 1st thing that a model does, is to unuline contribute their VF work. That is the 1st thing that happens. The 1st thing that happens, you forget your previous skill, and you basically start from scratch. And this is also sort of a more complicated start. But the whole point of, um, There should be a paper of this paper, which you don’t have to read. But the whole point of the story part is that when you’re teaching heels together, or when you’re speeching tasks together, even if you’re not fully making them sequentially, you’re just say, solve tasking, now, solve tasking, fast, B, where A 5 B means in one episode we have to do A and after that we do B. Usually learning has this pathology that the 1st thing that he does, when he starts to learn the parts of the task that he doesn’t know. So like if you have to do 2 things after the other, when you start learning the 2nd thing, this, that will be, because everything is sort of shared with the pragmatism or not, as soon as you start doing objects on the 2nd thing, the 1st thing that happens is you break the 1st thing that you need. So you cannot write new things without kind of, um, like messing up uh, the representation that you have and making messing up the behaviour that you have. And if this work will show that like if you add some kind of BWC view or some kind of replay or whatnot, it actually helps a lot in this kind of speech in scenarios because there is an interesting form of forgetting that happens, even within the same prediction, right? So this would be like, um, Yeah, anyway, I think I think I hope it’s a big figure. Is it? The forgetting, I’m just trying to say forgetting happens everywhere when you’re trying to learn. And it is sort of problematic, even when you’re trying to do this compositionality. And and and maybe this also point is I’m going to bring up, um, I don’t remember exactly why and how, but, um, Then why do this is happening because the other things that we’re doing are not localised. So whenever you learn in a neural network, any update that you do is global. In fact, all productive persons, all your natural. So because of that, it’s really hard to learn new things without destroying what you had before. Because basically you are adding noise in the entire network. Like what will make this work, you are taking the dream local life, right? When I learned something, I’m only affecting a very small number of parameters that are managed to solve that. So in my view, this is a being fundamental issue again, the learning approach is that everything is global. I mean, we slowly kind of started moving away from that, but we, we, we, everything is global in a, in a learning process in a neural network. And that is what creates the appearance that they about is forgetting about. Yes. No, I wanted to ask, like, why did you get it to feel like it is waiting to forget? Like, why can’t you just do this to us? Yeah, yeah, yeah. So this is just to show that that is what’s going to happen. So this is meant to be on a relation to show people that this is going to happen. Basically, what happens is if you learn to learn. When you learn phase 2 because of the forgetting, it takes you as long to learn phase 2 as you could just learn Facebook in the beginning. So our point was there is no point of pre-lining skills and adding them to the model because when you need to learn a new skill that composes with those, You’ll automatically forget them anyway. So you might as well learn the composition directly. And this is basically the whole point of why you can’t adapt path, right? Because the way biological system will do it. We learn these things separately and then we’ll be able to compose them on the fly and so pays to immediately, right? You do not need to, but like, Neural network. They can’t compose things. They have to learn the all composed solution at once in the same, saying, oh, they can’t just learn parts and then propose that into something different. This kind of the argument we’re trying to make. That’s fantastic, but I’m going away from that. Sorry, you are moving away from? Tukala, it’s there again. Without writing English and these methods or something else? No, so all of our like, we don’t, we don’t. I mean, the, the, the, the, the, sorry, the, I mean, matters like like this projection and whatnot. This was poised in the line a localised way, right? Because it projects away parts of the game. But I’m just saying, if you just do Adam, which is what you need here, there is no calibration anywhere. Um, you could hope that the great gentleman and so on becomes farce. But, A, that it’s, not going to necessarily be the way to work it, and B, like, you know, because I said, Bregan becomes partial, the bigger human, the model, because the use from my home, the object itself, the plant is a certain parts. But the momentum is rapid in the way different right here, okay. So even if the Indian can finally expires. If they’re last parts in the same way, when you, I was young, don’t let them, you could end up having bricks, you know, their parameters. And you could make an argument that maybe we should adjust momentum. But momentum has its benefits. It really makes everything very transferred. Some people would argue that no land to is the thing that works the best concern by feeling out life. So, you know, there is pros and cons. But generally speaking, yeah, you, Your upplanes are not sparse. I think there’s fires, you don’t have any control on that spicy. It’s not as a market, really, that’s a market, really, that’s a market, really, than you. So I think sort of an interesting question. Actually, this is part of your homework. Is what happens if you specify your outdates? I think it’s going to be on your format for the last, which is sort of be a political question of like, basically, I’m just asking you about the way. Trying something super toy-ish, purchase, fight, fight, that they’re Indian, since you are Hutches, like, that is still what I… But that, so that… Yeah, I guess we have 5 minutes. Let me see what are the next slides. Okay, I’m gonna go in this slide because it’s something we’re going to discuss, so again, there’s a lot of text. But, um, I discussed this one I was discussing about, you know, right? So I’m just kind of repeating myself. The one issue that we have in the field, that sort of like the elephant in the room, is that all of these effects that we’re talking about here, sort of this interference, this type of story forgetting, the poet transference, so forth. They’re all dependent on the data. Like, if your data is touched, then the tasks are related to each other, if you can see poet transfer, then you can do research on poet transfer, and you can try to argue why this works and doesn’t forever. If your task are completely or something up to each other, there is not going to be any composition every week. There is not going to be any code transfer. You know, it doesn’t make sense for you to be. So, The properties of the data matter, essentially more than we think. And this side is more like just a pretty pretty pretty deal, that we do not have tools to talk about this American data. That’s not a priority to film right now. I think that’s a problem. I think it shouldn’t be enough that you publish a benchmark and say, like, this is the new phone benchmark, this everyone should run, but it’s never a planning algorithm. Just specify why, what is the structure in this branch market? What are you trying to highlight? Um, and that’s just something to help team pictures. If you don’t have to pay for it, everyone to go visit a new benchmark, they look flashy, you know, school, they give you a lot of choice to run me, they do different things. But no one is saying, This is the structure I’m trying to expose here. I’m trying to show you that you cannot take advantage of this and that and so forth. And, uh, Yeah, maybe I want to stop on this like. Maybe the last point I wanted to say is that, um, It, it, that’s usually it doesn’t happen also because we don’t have mathematical language, that’s good. So I think the problem ranks deep in the sense that we don’t have enough research on similarity matrics between tasks of like what is the even the right framework? I mean, there are some ideas that people are using. So, for example, in not, you know, networks, like a lot of talk about symmetries and and all these kind of like groups and whatnot. That makes me potentially one language, another language that’s now being used a lot. Is that a personality? So basically the structure has been shared with the photograph of like, like separated in the environment. But none of these dragnoids are, I think, developed enough or exploited enough or it does not clear to me what is the language or not what the language you should use with these characters. You can, like, tailor expressions and then talk about for variants and things like that. That could be another option. I just don’t know which one is the right one. Okay, I know we still have like a couple of minutes, but I try to start here because otherwise I’m just gonna go forever. So yeah, I don’t know if you have any questions. We have the minutes, if not, which, you’ll see each other in the afternoon, I guess. Yeah. Oh, I don’t understand why the networks have a larger parameter space. Sorry, a larger business based than what? Compared to deep end networks. Oh, why the gradients are more of the bono? No, I mean, why, why, why the parameter space is bigger because you mentioned something like this in the 1st few slides. Yeah, I, I, I, okay, so maybe I didn’t express myself correctly. This was here. Yeah, yeah. The size of the practice space is the same. What I was trying to say is you should make the models wider. The gradients become more of propernal to each other. If you make them deeper, they do not become more popular. The dimensional ante is the same. But like the whole foil, okay. So basically if you make the system wider, you have less forgetting. To make it deeper, you have the same amount of forgetting. Was this only observed empirically or does the theoretical… But the justification is when you add wheat, Uh, like the, the, the, the, the, um, the fact that this place grows. It’s less restrictive because like units that are in parallel, they don’t, they don’t restrict each other. If you add that, then there’s all of these symmetries, but like you remember the linear regions, they’re kind of tied to each other. And even though you have no parameter space, the degree of freedoms that you have are less. So let me put it differently. Yeah, the degree of freedom, the parameter space does not grow the same, even though the embedding space grows at the same rate, so you have the same number of dimensions, the degree of freedom that you have in a very deep model, it’s way smaller than the degree of freedom that you have in a shalomot. Because you have all of these symmetries that decide something of the grade and gold. And the side effect of that is that the gradients do not be coming naturally more autobonology in charge, and you don’t get certain parts. Um, It might just strongly correlate with the degree of people that you have in the state. Oh, there’s nothing else. Yeah. I’ll be in… Can you have that much?

Title (Optional): dl