And following the literature, there is. Thing about, like finding preconditional matrices, right? So you have this, you define, you know, you pick a piece instead. It makes your learning faster, and there's some choices of three that don't make things go much faster. In particular. The water. We're gonna talk about it.

Is natural gradient? Which in the end, like Adam, is very related to it. So the way natural. So this natural gradient, I don't even have the citations here. The natural gradient algorithm comes from Amari. This is a paper. It's a very interesting paper if you ever bored and want to read a paper.

So, this is. Uh, Amani started working on natural gradients in the 60s. I think the natural gradient paper for deep networks is like 80s or 90s. Um, it's actually hot paper to write, and Amari sunchi Amari is like a Japanese researcher. And. Until recently, I would say the Japanese Community was kind of isolated from the rest of the machine, like the U.S European community, and you can see that in the paper like it was the most painful paper that I ever eaten.

Because the notations are completely different, like he uses Einstein's summation, which I wasn't used to. He calls by sees a instead of B, and the weights are called. I forgot, like, no, no notation makes sense. Like, if you look there, it's like nothing of the standard notation holds, like the activation functions are weird the way he uses them, but it basically introduces this kind of concept, and all he's saying is, like, uh, I can't explain a high level that the intuition here is, like, we're doing the usual thing, where we're taking a linearization first lawyer, Taylor expansion of our laws, and we want to minimize this, and we want to create some trust regions some constraint so that we don't move too fast, right?

So that the thing costs? But we want that constraint to be in a functional space, so his point is, and this is something that's easy to notice. Is that there are many values of theta that give you the same function, so your your model is not parameterizing a unique way.

So, for example, one thing you can think of is you take two neurons and you swap them around. So technically, you change nothing. Because if you swap here and you swap above, you get exactly the same function. But if you look at the weight matrices. The weight matrices have swapped columns, so it's technically a different Theta, right?

If you're, if you decide two different points in the space, right? So, his point is that, um? The, the, the, the, the. Mapping between parameters and functions is not necessarily one to one, and even if it's one to one, you know, sometimes a change in in Theta creates a big change in your functional Behavior.

Sometimes it doesn't, right? So, it's not sort of like, there is a metric there that says that functions don't change the same way parameters change, so all he wants to do is to say, instead of saying, I want to take a step that is small enough that my Theta doesn't change more than Epsilon.

It says, I want to take a step small enough that my function doesn't change more than Epsilon. So, that's the the big the big idea. So then, the next question is okay if I want the distance between functions. How do I make a distance between functions and that's a very hard question on its own, so the choice that I'm already did was saying, well?

We know in neural networks that the output can always be interpreted as a probability, and it goes back to all of that things about probabilities so we can always think of the output of a neural network as some, some corresponding to some kind of distribution. And we know how to compute distances between distributions.

And this is what we're gonna do so, so P. Of Z, given Theta is basically the the distribution that comes out of your model and we can use Kyle, which is a Divergence. It's not a distance, but it's good enough and everyone uses, so I'm going to use the Cayle to measure how much my output distribution changes if I change Theta, yes, very same.

Here is total distance. Yeah, what does it mean in this context? It means that if I do Cal of P G given Theta. Okay, if I reverse the order of this, terms understand the same. Yeah, it's not semantic. That's the only that is the only one thing that's missing from hard version to be a proper distance.

It's not symmetrical, but otherwise, you know, it's zero. When these things are equal and everything else kind of the the? Triangle inequality, I think, holds. Anyway, there's like a bunch of properties that could fall. It means how different they are. Yeah, yeah, so it measures a way of differencing.

Just, it's not symmetrical so that that's kind of important and. That also means that if I reverse the order here, I'm going to get a different algorithm. I did because it matters when you do your expansion. Okay, so, so this is sort of what he was going for. Let me kind of try to skip out a bit straight.

So now, the question is, okay, so we've done the first step. So, we've we've decided how to make our trust region. But obviously, this time is, is nasty, right? I have the style term. I don't know what to deal with it. So, the next step in this process is okay.

I want to replace this constraint with a constraint that it's a lot more pragmatic that I can actually use to do what I'm doing. Um. So the way he does. This is, uh, he takes a secondary Taylor expansion of the kayam and then plugs in the secondary Taylor expansion of the coyote instead of the Chiablo.

And the reason for that is because once you take the second order expansion of the Kyle, it will turn out that that simplifies a lot. So if you write down. Again, I don't want to go necessarily through the map, but I'm just going to say a few words, and you guys can look over it whenever you want to.

Uh, if you, if if you, if your instinct is, but like, if you write the formula of the kael? Um and. Um, here, I wrote the sum of the values of g, p, z times log in Z minus. So, because you had the ratio of Pz times the log of the other thing, and then you you convert this into expectation because it's easier.

And then you start doing the the Taylor expansion around Theta for this term because you're gonna get the Delta Theta. What happens? And you just have to trust me. Is that the first order in that expansion disappears? Because you get log of pz Theta minus log of P z Theta, so that's zero.

So it cancels out. And then the first order term of the scalar expansion, the one that's just in terms of the gradients. Disappears again. And the reason that the reason celebrating more technical. Let me see if I have it. Sorry, do I have the math? No, I don't have the map right here.

So, the reason there is that you can reverse all of this, the expectation and the derivative are linear operations, so you can reverse their order. So, what you can do is, you can push the expectation inside, and you take the gradient of the expectation of this first return, and then you push the expectation inside, you get a one here, so you get derivative of 1 with respect to something which is zero.

Anyway, you can look this out. But technically, because of the expectation the first time disappears as well, and the only time that is left is the second order card. Which is Delta Theta, the second order derivative of log of P z, given Theta Delta Theta. And this is kind of nice, because now.

The second order derivative of log. Is going to be your metric. Yeah, so if I go back to the constraints, all of these terms disappear. And I really don't have the. No, I don't have it. If I go back here, basically, I end up replacing this tile. By a formula that looks.

Do I have anywhere from Adele? Sorry, I don't, but you replace this guy else by something that looks like Delta Theta second derivative of log P Delta Theta? So, the second order derivative of log P, which is a hashian. This would be my new Matrix. This would be this will.

This will end up being what I'm conditioning with. Um.

This second order derivative. Of of a loss? You can. You can always use. But, am I expecting this here, or am I just kind of? No, I'm not. You can always rewrite this so you can use the layout architecture. And you can see that this this this, because because you know the loss.

Is the loss function applied to the output of your model? You can rewrite it as some of these terms. This is basically just using the chain rule, so you say, like? So I have? Um, the the first derivative of the loss is the derivative, you know, because of the chain rule is the derivative of the loss with respect to the output times the derivative of the output with respect to Theta.

That's just the chain rule, right? That's just writing those two terms. And now, when I'm taking derivative of this again, to get the second derivative, I have the product rule. So, because of the product rule derivative of the product of these two matrices becomes this term class this term.

And this is kind of like a scheme that's being used often and in the second order derivatives and the reason for using this scheme is because the way people usually reason about this is, let's say, I'm gonna drop this Matrix. And the reason I'm going to drop this Matrix is because when I'm close to convergence, this loss is going to be close to zero, so this time is going to be close to zero, so it doesn't matter.

And I'm gonna keep this this Matrix. And this Matrix is usually called the gauss Newton approximation of the Hessian. If everyone heard the term. And the nice thing about this term is that it has this form of Jacobian transpose Jacobian. And you know that when you have this form, this has to be positive definite.

So, this is how people usually get rid of negative eigenvalues. And it turns out that when you do natural gradient. Okay, so I know I put these comments because I was going to derive all of this on the board, but I kind of directed in India, but the whole point here is that.

If I'm looking at this? So I have? This delivered the second derivative. Of. P. Of Z Pizza. So, this is the term that I'm our expectation over. Of this. So I can? Without working too much into it, I can say. That this can be approximated by. Derivative of log.

Times derivative of log B. And there is a transpose somewhere. Uh, I'll put it here. I'm not sure if it's the right place. But basically, you can do the expansion, and you get the distant things. Like, what about the Delta Theta? Sorry, what about the Delta Theta, so the Delta Theta will come in front of this.

So for now, I just took the the this and I wanted to expand. So, the reason I'm trying to do this is because I want to prove well. Uh, I can. I, I'll I'll send to everyone. I'll I'll update the slide with a link. Actually, if you look up this paper that I'm author of the revisiting natural gradient itself.

In the paper, you have this step by step the math in the appendix. If you really want to go through it, so you have the step by step of how you go through there to that, but that for here. Like, I just want to give you the impatient, so the the big change of natural gradient, so you you replace this with this functional distance, and at the end of it, when you take the Taylor expansion?

You end up with a Hessian again. So, one question you could have is like, what have I solved? I had a Hessian before. I did not like it because it had negative curvature. I went through all of this exercise to change it to this distance and whatnot. And in the end, I got another Hessian.

So, is it good that for anything? So the trick, the tricky bit is, you can? Try to expand that that Hessian and you can exploit that. This is an expectation of that Hessian, and it turns out that if you do that, you get that thing is equivalent. To an expectation over the outer products of gradients.

So, the advantage of this form? Is that this is positive definite Pi construction or at least semi-positive definite? It doesn't have to be positive, definitely, but it's at least any positive, definitely by construction, so. Yeah, you can just take my word for it, or you can look at any kind of linear algebra book, but I think the any, any Matrix that has the form a, a transpose, or a transpose.

A I don't know what the right order is. It's positive, definitely. By construction, you can prove that. So, to make your understanding. And then I solved the problem of the neonatos having a very large negative eigenvalues, exactly using this. Yes, because this Matrix does not have negative eigen values.

So, even though in the end, I end up with a hashion. This action does not need recognization, so this plot here is trying to illustrate what's going on. So, the red surface I should have make it bigger. The red surface is the loss function that you are navigating. So, what I'm trying to show here is that once I've plugged in that tile and work it out is, I have a different surface.

And that surface is always quadratic. It doesn't have a saddle and what I'm doing is I'm I'm trying to take a decent step on the red surface using the curvature of the green one. That's basically what we're doing when you're doing natural gradient. We're creating this additional function, which is the Kyle, and we're just taking the curvature of the Kyle and apply it to the function we cared about, and this makes sense because what we're really doing is solving this constraint optimization?

Well, what we're saying is we're solving our original problem, like the first order approximation of the problem, where we constraining how much the distribution is changing in a Kyle sense, and by doing that. We really end up with this algorithm. That it's a precondition HTD, so it has this form where this preconditioner ends up being the outer product of gradients.

So, then it doesn't have any negative curvature. That is the big thing of. I'm natural gradient. Um. Yeah, so I'm not going to work on the board because I feel like we've seen too much math anyway, and I'm worried that it's just gonna look like numbers, but the additional thing about it is, so you have this particular derivation of natural gradient, which says, if I constrain the Kyle and I do a bunch of math, I end up with this preconditioner.

That is just the outer part of Radiance. And then you have this derivation, and this is not natural gradient. This is. Traditional second order methods where people in secondary methods they use this gauss Newton approximation. So, this is a different derivation where it says the Hessian of my loss that I care about is equal with the.

Outer product of the gradients from the output to the input. Times the session of the loss. Plus, the gradient of the loss times the Hessian of the of the outputs. So, this is. This is the sort of a big thing in the optimization world. This is called the Gauss Newton and the reason I wanted to bring.

I mean, there are many reasons I probably had this slide, but maybe one reason I wanted to bring this out is because it's kind of interesting, sort of, uh. Maybe just add a bit of History to make it a bit more fun. It's how these things evolved. So, so am I did natural gradient early on?

But then people weren't actually using it. So for a long time, people were kind of ignoring it. Then there is a paper from Joshua. That is claiming to do natural gradient, but it's actually not doing natural gradient. So, in that paper, they messed up the the preconditioner, and they use a different Matrix.

And, and then sort of. In parallel, you have folks like James Martin's that were using this gaussians and approximation, and they were claiming that Gauss Newton works better than natural gradient because it is connected to the Hessian and whatnot. And as sort of later on. There is a paper that actually shows that gross Newton, a natural gradient, is the same thing because it turns out that this Matrix is exactly the second order derivative of the chiral.

And this is always true if you. Matching activation function analysis. So, if you pick like a reasonable loss and activation function, which you typically do, even if you don't know about it. So, like if you have a negative log likelihood and sort of soft Max or whatnot, or if you have mean Square error and and linear output, then this gaussian approximation is exactly the same as Amari's Fisher.

And all of this. So, the reason why this is important is because it tells you, besides, sort of the motivation that we had here. It tells you why this makes sense. This makes sense because this turns out to be an approximation of the Hessian of your true loss. Because it turns out that this Kyle.

That Thailand, which is exactly. This Matrix is related to the loss of your function. By this term, which typically is seen as being zero.

I think I lost half of you, but the whole story was really about. Uh, natural gradient is usually presented as a preconditioner, because it's not about using the Hessian of your losses, but it's about using a different Matrix, but it turns out that this Matrix has a lot of relationship to the tragesian, so this is sort of the whole point I'm trying to to get to you is that this other products of Radiance?

It's for for very for various reasons. It's a very good proxy of your curvature. Um. And the reason. This is interesting, because if you squinted it, this is very close to the formula that Adam uses. So, Adam takes gradient Square. And we know that radial Square. Now, if you assume here that every Matrix is diagonal, which is not, then this will be just gradient square, and this is sort of a proxy of that.

So, so this is kind of just telling you that you can approximate the curvature of a function by just looking at the square of the gradients. Do you expect this part? Yes, correct words. Adam is a special case of this. So? That's what I'm trying to convince you. They are connected that there is some glitches on the way.

But yes, you can think of Adam and this particular paper if you want sort of a much more formal way about it this. Bayesian learning rule from NTS, Khan, and others are really trying to pin this down and trying to argue that Adam is just a diagonal version of natural gradient.

So, this is sort of. The whole, the whole spill of that paper. It's it's, it's a bit more complicated, um, sorry, I think. Uh, it's a little bit more complicated for a few reasons. Okay, let me let me go why it's a bit more complicated, so? Um, the first reason.

That is a bit more complicated. Is that in the gauss neutral approximation and in the natural gradient? Uh, what you do when you're Computing this Matrix is you have your gradient from the output to your Theta. So, if your function is, so it's from y to Theta, right? This is not from the loss to B guys from y to Theta.

Um. In Adam, you're using the gradients you're using to do your optimization. So, using the gradients that started the loss and go to Theta, you don't use the gradients that started Y and go to Theta. Um, in the literature. This thing is called the empirical fissure, so you have the fissure, which is what Amari introduced and the empirical fissure, which starts from the loss instead of starting from Y.

And reason why the literature around natural gradient. If you ever find this topic interesting and decide, looking at the papers is a bit messy. It's because people confuse these mattresses all the time. Like, you have papers that messed up empirical feature with the proficient and replace one for the other.

They're not the same object. They're different mathematical objects, but people use them interchangeable. So, a first step to go from natural gradient to Adam is to replace the true fissure by the empirical fissure. Um, the other thing that makes Adam a bit different is that you take the square root.

And and and the square root thing is a little bit of. No one really knows, but it works really well kind of thing, so. There is different ways to argue for it. So, one way to argue for it is if you take a matrix. Any approximate that Matrix by the diagonal, so you remove all the of diagonal elements, and you just look at the diagonal then.

Usually, if you look at the eigenvalues of the diagonal versus the eigen values of the full Matrix, you're usually overestimating. The eigen values and taking the square root is you're basically pushing them back down the correct that over estimation. But it's this is a very handy argument, but that's something people use.

Another argument that maybe some people would like is if you look at the formula for Adam. And you pick units for the different quantities in there. It turns out if you don't have the square root, the units don't work. I don't know if you guys know, but in physics, people do this all the time.

If you want to figure out if you have the right formula for something, you just plug in the units and you check that the units come out the right way, right? If the units don't work out, you've messed up the formula. It's not the right one, so it turns out that if you, if you, if you, if you, if you assume like this quantity is a unit like, you need a square root to make the units work.

It's another argument that I've seen for that square root, but really that the answer is that the square root really helps quite a bit and and the reason. And okay, let me give you another better reason of why you have to start it. Yeah, another reason is because you don't use the true feature you use the empirical fissure.

So, what is the difference between these things, so the thing the difference between these things is when you are convergence. When the loss is zero, these times becomes zero, and the true fissure, which is this. This function becomes your hash shape. It's the same as the Hashion, so that's fine.

But if you look at the empirical fissure, the empirical fissure is the outer product of the gradients of the loss with respect to Theta, but if you had convergence the gradient of the loss with respect to Theta is zero. So if you square this thing, you get zero. So, if you look at that formula where you have Grady and over radian squared.

As radians go to zero, that thing goes to zero even faster. Because you're defining high gradient zero. So, what that means is that Adam becomes unstable when you're close to convergence if you don't have the square root. Because if you're looking at that limit, as the gradient goes to zero.

Like that thing, explodes goes to Infinity because you have G over G squared. And the skirtle solves this. Doesn't. Well, it makes it better, which is, which is good enough. It seems for convergence. Because the the rate at which G over G squared goes to. Um, I guess in in this case would be like plus infinity because it'd be like one over GSG goes to zero.

Like, it doesn't go the same at the same speed when you have G over G squared. Yeah, this kind of stays constant, right? Uh, because if you have G over square root of G, squared is equally one more or less up to some variations. So, so this makes things stay a lot more stable.

So, these are being the kind of arguments that I've seen so. Okay, I'm I'm gonna push a little bit through this, so I, I can appreciate this part of the of the live channels. It's a bit kind of heavy. Optimization ends up being that way. I wanted to make another note that is kind of useful.

Which is, we did all of this exercise because we are worried about negative eigenvalues. But one thing that turns out is that you still need to regularize the fissure idea. The Fisher is this Matrix that we've been Computing. This other products are playing with. And the reason you still need to regularize it with the identity.

Is because you, it's semi-positive, definite, so you don't have negative eigenvalues, but you have eigen values that equal zero. And that is still a problem. Because that means you can't compute the inverse. Uh, so in practice, you still regularize this Matrix to get rid of the zero eigenvalues, but at least you know that the only thing you need to do is to correct for zeros.

So you can add like a small Epsilon just to make it invertible without in. Well, I don't really waited by something small. Yeah. Anything, uh, positive? Well, you need to be careful because, like, um. There is this, I mean, you have the condition number, right, which is the smallest I can the, the largest second value divided by the smallest cycle values.

So, if the condition number is very large, so if you look at those, if you have the spectrum of eigenvalues, and you look at the extreme of it, like the largest and the smallest. If the gap between them is huge, then usually numerically. Things are not very stable, like when you try to do the inversion directly.

Things are not very stable, so. You know you, you don't just want to add 10 to the minus 12 and say, okay, I don't have any zeros. My motorcycle value is 10 to the minus 12, because then numerically, your inversion might still be unstable. So, you want to add something that's big enough so that the gap between the largest and the smallest is sort of reasonable, so that the inversion process is.

But this is more in the numerical stability, kind of space, and things like that. It's not anymore about dealing with negative curvature. It's really about trying to make your computation stable when you're trying to the inversion. Um. The other thing that I wanted to mention and sort of. When it comes to practical methods that are trying to do the second order methods, like one thing that they're they're doing, is that they're using a block structure and the reason they're using a block structure is because people have been arguing that the off the blocks that correspond, so block structure.

What does it mean? Is that your computer Hessian per layer? So you don't look at how a layer affects a different layer. You just say, I'm going to compute the hashion for each layer independently. And you know if I'm looking at my whole Hashion? It means I assume all of these blocks are zero.

And I only have non-zero elements in this blocks on the diagonal. Um, and this is just a trick to save memory and compute. And, and the reason is because people have argued that these blocks tend to be very small in norm and. Locks here have, like, so this is just another.

And I'm gonna skip that slide because it's too much math. Um. So, this is sort of a high level, so we, we've done all of this. But then? Um, this is at a high level. Yeah, so this is sort of what natural gradient is about. This is sort of what's economic methods are about, so it's Computing these sessions and it's kind of dealing with negative curvature.

But then, like when it comes to? Training models for? Standards scenarios. Um, it turns out that. Keeping things cheap is crucial to make them work. So, maybe to repeat something I was saying in the break is all of these methods can be very useful when you have. Domains, in which the low surface is well-behaved and the example I gave was, for example, you're trying to simulate the navier stocks formula or something like that you have some neural network that is trying to predict sort of the evolution of this system.

Um, this is sort of mathematically if you look at the loss of trying to predict this. This is going to be something that looks very ugly, so that's a place where you want to use this powerful optimizers. But if you what you want to do is to learn to classify imagenet?

Or you try to do language modeling, which maybe many of you are thinking about, then it turns out that this is not as important, it turns out that actually, the laws behaves quite well and usually. Usually, the outcome is that keeping the updates cheap is the best way to go about it and then sort of.

This is sort of. What Adam and momentum does Adam and momentum you can think of them as some degree? They're sort of a very crude approximation of all of this business about curvature and whatnot. But they're extremely cheap. And because of that, they dominate. They work so much well, in, in almost all standard scenarios, they they work well, there is another sort of hypothesis in the air that people have been talking about it for a while, which is.

Especially for modern architectures. We've been developing them to work for Adam, so there is this formal. I don't know how to call it this fear of. Um, what we have done with Transformer and other architecture like resnets. We basically tuned the architecture to the optimizer, and if we were to be using a much more powerful Optimizer.

We might have found much more interesting architectures, but like, for us, like these things are intertwined, right. Like, whatever, I propose, a new architecture I run into if Adam, and if it doesn't work well for Adam, I assume that new architecture is not a good idea, and I'm not even going to publish the paper so that that itself is like, what?

Maybe drove things to this place where, like, everyone is using Adam? It's it's sort of this bias in the community, where in general researchers prefer to focus on the architecture, not on the optimizer. So, usually, they pick the default Optimizer and run everything with the default Optimizer. So, therefore, when they're developing architectures, they're finding those architectures that have well-behaved curvature so that Adam really, you actually wear to come.

So, this is just sort of all of the things before it's heavy math. And it looks scary, and it's interesting. But in the end. Like, in most cases, you're probably never gonna end up using that. You're gonna just end up using the simple thing, and the simple thing is is something like RMS proper Adam.

And here, I mean, maybe the momentum should have been come first and then talk about that. But here, basically, you have two terms, right? The first one is the momentum term, where instead of um? Instead of using your gradient. You're using a moving average of your gradient? So that does multiple things, uh.

One thing that it does is it reduces noise. Um, and it sort of like shrinking your gradients in Direction. That way, you have high curvature where the gradient changes directions all the time. Um, but also this is not a noise, and it becomes an approximation or a better approximation of the true gradient.

So, instead of like overfitting on the current mini patch. Now you have this moving average of all the previous mini batches, so it's harder for you to kind of be super biased by the current battery idea. And the other thing that you're doing is you're Computing this moving average of the squared gradients.

And one way of thinking about this is the square gradients is a measure of curvature, and this comes from all of these empirical fissure, natural gradient, and so forth, where you can show that. You can always write your Hessian as the sum of other products of radians, plus another term.

And that term usually disappears. So, by the way, I did not say this explicitly. But when you look at this formula, you see that there is a hashank here in the middle. This session, for most losses, is a constant. So, that's why I ignore it. So, this formula, the gauss Newton, is really just the outer product of gradients this session.

It's here mathematically. But like, if you write it down for existing losses, it turns out to be a constant or something that doesn't matter. So, so you know, so that's kind of the connection with, like the gradient square is part of the proxy of your. Um, approxy, for your for your curvature, and then you take a step that is basically.

The momentum divided by the square root of this and, and this is sort of what you're gonna. Is in practice, and you have this Epsilon. Which is this regularization to deal with zeros? We are taking talking talking about how Fisher has zero eigenvalues because it's the same positive definite like this.

SD can also have very small numbers, and you don't want to divide by zero. So, this Epsilon, it's a regular for that. It acts as exactly the same role, so that's the point. Usually, that Epsilon is super small. I know you guys did a course on RL. Uh, so one hint that I have is that in around, it turns out that it's very useful to make that Epsilon very high.

Uh, the reasons is not that well understood, but the difference between supervised learning and RL is that in a row if you do anagram. It's actually quite useful to hypertune Epsilon and to make it larger than normal. Um, but otherwise, this is what it is, and this is, as I said, I kind of replaced UI type.

This is the momentum this is. This is kind of the top part of Adam. And really like the intuition of momentum and how it connects. The curvature is in this picture, right? Like, if you have a direction of high temperature and direction of low curvature. You're gonna get sort of this zigzagging behavior, like.

Now, if you compute the momentum, you'll notice that in the direction where the gradients agree, you're gonna accelerate, and in the direction where the gradients disagree. When you're doing the moving average. You're gonna squish the magnitude in that direction, so you undo the blue curve that it's a lot more aligned to the to the curvature, so this is highly of intuitively how the momentum works, and it's sort of.

There are different ways of deriving it. Another way to deriving it is to take sort of inspiration from physics, and that's why it's also called momentum, and it's really sort of. Like, if you have a ball and you let it go on a Surface, you know, you get sort of a similar kind of effect.

Um there as well. Um. The last bit that I kind of alluded. And after this, when you're gonna take. A short break and then go a little bit into generalization part. So, another thing that I've alluded for, like, so far, we've been talking about gradient descent, which means we compute a gradient on the entire data set for the entire loss.

In practice, that is not possible. So, in practice, what we do is, we estimate the gradient by using a mini batch. And the way this works is we basically need to make an assumption that our data points are ID, so they are distributed according to the same distribution, and they're sampled independently.

And if you make the distribution, we know that the expectation over this is equivalent, or it's approximated by just picking a few samples and and doing the average of samples. And this is sort of the difference. What people say you should expect is if this is greater than design that social classic idea should look right.

It's moving in the right direction, but it has a bit more noise, um. The sun has been mostly introduced as a scheme to make things scalable. So, like in the early days? Everyone, so that radiant decided the correct thing to do, but we can't afford it. So let's do stochastic, and it's a proxy for it in more recent work, and we're going to talk after the break about that.

It turns out that actually stochastic gradient works better than gradient descent. So, it's not about computation efficiency anymore. It's like, it's the better thing to do. You get better results out of it. Um, and, and this is just sort of. In terms of gaming convention, so toxic gradient means your batch size equals one.

Means you have the entire data set mini basebate and the scientist you fix a minute by size. So, you split your data set into groups of whatever Trend. 10, 20 times 56 or whatnot. Um. And then maybe go to this slide and the other very important thing is. Even with all of these measures of curvature and whatnot.

It still turns out that the best thing to do is to adapt learning, right. Learning rate is not a constant somehow because of the approximations we make, and all of that, like these things, will not tell you the right thing. So, by far, you cannot get the right results.

By far, maybe one of the most important thing is to have a learning rate, Decay, or learning grade schedule. This is a very old-school learning race schedule. This is how you used to do it where you started the constant learning rate, and either you have magic numbers when you divide the learning grade by something or you look at the trading loss or validation loss, and if that one doesn't decreases, then you divide the learning grade as a number.

This is how you used to do this, and I used to call this waterfall learning rate schemes. Nowadays, you have this. Like? A linear warm-up, and then sort of exponential decay of your learning rate. And this is kind of the standard, and usually what you hypertune is sort of the highest learning rate that you start from.

And, and then this game's yeah tend to work very well, so this is just sort of like the standard, uh, I don't have a lot of slides about why and how, but if anyone is interested, I'm happy to talk about that. And here, I'm gonna stop here for a break Midway for like 10 minutes, and then we're gonna continue for the last hour to talk a little bit about generalization.

But. And this is. For a long time. This has been the main perspective on learning neural networks. It's just they're non-convex. They have this very complicated loss surfaces. Therefore, it's impossible to train them to do anything useful, because anything you do, you're gonna get into a local in a local space.

And it got to the point, so I was telling you at the beginning of 2000 that they were like, maybe like three or four groups that were still doing here on that tracks, and everyone was doing svms and other things. So, in svms, like, you have convexity. You have, you know, you have guarantees that things are gonna work out.

So, really, what was going on is like, everyone was worried about this kind of issues, and then he had a bunch of people who are, like, well, you know, maybe it works. It's like, you know, it's. It's not great, not terrible. Like, let's go through it. Miracle of what happened is these people were playing around with things.

Obviously, they got some things right as well to make it work. But then they started seeing consistent behavior, so you actually have a paper it's not cited here. But there's a paper from Demetriarchan and Joshua banjo. Important, but I keep anyone wants to see it, where they actually look at this systematically, right?

They do like sample many random seeds, many starting points, and show that there's like the behavior of the neural network is consistent. And all of this kind of stuff, it's just because, back in the day, this was a big theme, right? This was a big reason why people did not like neural networks because they believe this.

Like, it's all about luck, and there's nothing systematic going on there. So this is sort of where the problem of generalization starts right. Like, you have this loss, like, sure, you can optimize on it and get somewhere. That is better than when you started, but is it anywhere meaningful, and how do you consistently go to somewhere?

That is really good, um?

Is, if you have a neural network and you have some data, you throw the data. The neural network of Adam and things are going to work out, and you don't need to worry about anything. And if it doesn't work, just make the neural network bigger, and it's going to work.

And that's it. Another way of framing it is that people now are viewing the optimization problem in neural networks as being almost comebacks. Maybe it's technically not convex by the behaviors if it's convex? And this is on the back of. A bunch of words that these are different plots here from the different works.

I'm gonna try to explain so historically the way things have happened is after some changes, including maybe switching to relu and some changes to initialization and some changes to SGD that kind of happen in the in the background, and then some stuff with the three, um, layer wise, pre-training that the Jeff and Joshua were doing.

Have started to work consistently. And there was more and more results, you know, like the image, net results, and so forth that people could reproduce and and or get these good numbers. And then you started getting a series of papers that were trying to check what's going on. And these are generally empirical papers.

Sample this paper here. This is the paper from Oriole. Uh, you're a good fellow, and Andrew sucks. And what they do is they take the starting point, or you start Theta zero, they take the point where you end up with convergence, and they interpolate on that line. So, this is a line in the parameter space.

They interpolate verifiedly, and they compute the loss at every point on that line. And they show that the loss is monotonically decreasing. Point is if I walk on a straight line from what I started to, I converge. There is no wall. There is no weird shape of my lost surface.

It's just something that's monotonically decreasing, like in a convex case. Done this on a bunch of examples, and they showed this. And you know? From now, this hypothesis that look things looks almost comebacks. If you want to go on, the more methy heavy side. This is the much methy heavy side, and this is actually a pretty cool paper.

I really like this paper by Yando fan. I'm an author so I'm biased. There's another paper by Andy Coon that says exactly the same thing, and people usually cite the animal controls paper. So this is right on my paper here, but this is a different take. So this is.

Statistical physics. They come up with this Theory looking at. Um, gaussian. What what? What are those, uh random, gaussian Fields? Which I do not know what they are, but there is a there's. There's something that's called random gaussian field that's been studied in statistical physics, and they notice that they have this very interesting property that as you expand the dimensionality of the space in which this random gaussian field exists.

Have this, this property that the error that you can get and the index they become strongly correlated. So, what does that mean because there are, this sounds a bit heavy. So, what that means is that the lower the error? The lower the index and the index is the number of negative eigen values, so the lower the error, the less negative eigenvalues you have.

So, in basically, if they, if this correlation is very strong, like, is shown in this picture as the direction that the space grows, then all the. All the? Um, um. Fixed points. That have? Zero negative eigenvalues. They also have very low error. So, let me rephrase this, and this is sort of kind of the theme of where we are now in thinking about the problem.

So, what that means is that as you blow up the size of the model? Your, your all your local Minima will basically have the same error, so they're all going to be very similar to your Global Minima. What you're gonna get is an exponential number of Saddles, and that's why we talked about sellers before.

So, the intuition right now is that you get sort of. The only thing you need to worry about is settle points. You don't need to worry about minimize, because, yeah, I don't think I get the intuition. Why all my local Minima become like there are between them and the development and becomes small?

Why they're all kind of go to the bottom? Yeah. Uh, so.

Yeah, so, so, okay. So, the way this paper works is really taking these results from physics. And here, there's some like proof of this, because the particular kind of thing. And then they, they try to connect this random gaussian Fields, or if both machines and connect both machines and your networks, and then that way, they say, oh, this have to behave the same.

And then they empirically compute this kind of plot that shows the same kind of strong correlation. The intuition of why this should happen. It's a little bit kind of hard to say, like the the usual framing. So this is this is something young, like, who used to like to say is that if you're in, like, if you're in a high dimensional space, the probability that all direction will point up is very low, so as you increase the space, you're always going to find a way of escaping anything that is bad.

That's sort of the way. How, how he is framing it, but really, I think what is happening is, um.

Something weird that's happening with distances where things becomes closer to each other or more equally distanced from each other, and somehow that helps you navigate that space a bit better. It is not clear to me what is going on either, and that's why I call this a myth because we don't necessarily have any kind of.

Grounding of why this has to happen. All I can say is that basically has been observed a lot and and people have realized or or started, kind of connecting the dots and starting saying, like, really, it seems in a standard training regime if I'm trying to. This is another thing is, like.

If you train, or if you stop through the training and you try to validate your Hessian. And you try to see whether it has negative eigenvalues. You'll find that it always has negative eigenvalues. And the thing is, if it has negative eigen values, it means that there is a way to escape to go lower because it means that, at best, you're in a saddle.

So, these are just things that people have observed practically. And, and this is sort of. The the background in which we we are right now. That doesn't mean that Allah surface is not yet behaved, and this is just sort of a few examples of how things go wrong, and I'm going to try to explain it.

So in this first paper, this is sort of more of a joke paper, but it's pretty kind of funny. What is trying to say is that if you give me an image so you have the Christmas tree image, I can find a place in the Lost surface where the loss looks exactly like that image, so there is a Subspace in the loss that has that shape.

So, the point of this paper is trying to say that. If you want in a Subspace of the Lost surface. So the low surface is arbitrarily ugly, but the only trick is that it arbitrarily ugly, far away from zero. So, I mean, there is some like layers here. So, the usual, maybe let me?

This. This deep learning? It's the way it exists and the reason it exists. Is that because everyone who's doing this experiment follows the pipeline there is? There is a protocol of how you train this model, right? There is a scheme to initialize them. There is a way to do the to the garden and so forth.

So the point here is that? If you go far away from the standard utilization, you get into trouble. And one way to get into trouble is, you can have this kind of funky theoretical work that I'm trying to say that. Like, look, you can find any shape you want in the surface, or sort of, maybe even more trivial stuff is.

You can Define your training set in such a way that all your relus. If this is a relevo model, this is what this paper is trying to do. You can, you can buy by, play with your data points of how you pick your data points. You can find the data set such that, for one layer, all your values are zero, and then there's no lining happening.

And there's a bell local Minima because you're random.

Here, which is, there is a way of initializing the model. To make it have zero training error. But be chance on your validation set? So you can completely break sort of disability of the model to generalize by just playing with the initialization. But the core idea, the intuition I'm trying to say here, is that things look almost kind of convex empirically and and things look well defined, and they're sub theoretical reasons that are a bit hand wavy in the sense that there's some very strong mathematical results in statistical physics and understand, like.

On networks that people have been employing. But all of these things look good as long as you follow as long as you do things properly. And by properly, I mean, you have proper initialization and you use sort of a standard Optimizer and you have normalized your data and your data.

You know, there's no battle. Into your data so? As long as you're following the protocol and you're doing sort of normal lining, things look good as soon as you move far away. Things look bad. There is this funky paper. This is just for people who, like, pumpkin. So, this is from.

We have Belkin, who also did double descent, um. Something that, like it's to me, is kind of surprising. In the end, it doesn't mean anything, but it's kind of surprising. So, the three quivarello Network is I can multiply with alpha one layer and one over Alpha. The other layer and I have the same function.

And that's just because if Alpha is positive, it's not going to change the sign. So, the relative doesn't matter. So, really, I have W1 times, Alpha times, 1 over Alpha times, W2. So that becomes W 1 and U2.

This means that the space is always curved because this one of our Alpha is sort of curve. So, if I have a global Minima, your Global Minima is never going to be a point. It's going to be a region, and then region is always curved, because that region has to follow this one over Alpha, because if I have a point on in on, that is a global Minima or a local Minima.

I can construct another local Minima by just multiplying with 1 over Alpha and Alpha the layers below so I can construct this curve. So, what that means is that, and this is sort of the the punch line of the paper you have a look at Minima and Zoom. No matter how much you zoom in, this always going to be curved, and it's always going to have negative curvature.

Attendant attention that you make is that things are locally convex. Therefore, I can use gradient descent and all this stuff. So basically, all of this light. It like details don't matter, but all of these slides are they're trying to say. Is that? Surface or neural networks are relatively complex as a mathematical object.

And and their papers that are trying to highlight this. But at the same time, as long as we use them following the standard recipe, they are extremely well behaved to a point where people don't worry about them anyway. So, um? We'll now try to go into some more reasoning of why things are well behaved and in some sort of the usual reasoning of where this generalization power comes through, not necessarily connected to those early work, but something that's a little bit more like Hands-On and makes a little bit more sense.

And to do that, I'm going to start from the standard kind of point where people start when they start talking about this. And usually, you know the way you would introduce this is you look at a plot like this, and you say, okay, I have some data points, and I'm trying to fit something to them.

I try to fit the line. It looks something like this. I try to fit quadratic. Maybe look something like this, and then I try to to fit a ninth order polynomial, and then it'll look something like this. And if I'm just looking at my training error and my application error?

Whatever. But then, if I'm looking at these plots, like which one seems more reasonable for the data that we have? Yes, probably this one, right? So, there is a sense in which this is our kind of the overfitting underfitting. So, there is a sense where, like, just driving training error down is not good, right?

There is a there is a point where, like, you're not doing the right thing and and and usually. What, um? What we care about in neural network and the way we kind of find this. When is the right time to stop not to get into the sort of nine order fit, it's usually this is done by the usual train test split.

So, so here, like, uh, I'm trying to say is that well what we really care about is to be able to generalize in domain more or less at the moment. Because we're talking about in domain and the way we can ensure that we're not losing that property is, we rely on statistical learning, which basically what that means is that we assume it takes some distribution Pi, and we've sample data from the distribution in ID fashion.

So, every time we want to compute this integral, like the laws that we really care about, the only thing that we have to do is we need to have some unbiased samples so that we can estimate the loss outside, and this is kind of like how things go interactive.

You have a bunch of samples, okay, calling the training set. These are the ones we're allowed to trade on, and then we have someone by example, so called the validation.

Starts increasing. It means things are bad. Um, and, and this underpins. It's basically the entire machine learning. I mean, it's not a big thing, but this is sort of the standard thing, right? Any anything that you do, you always have. You know, when you're dealing with data, you always or training, set the validation set, maybe a test set, you know.

You can do. Other things, but this is sort of the standard recipe, and you always train on training site and use validation set to to to see where you go. This is not the only choice, as I said. And this choice on its own is problematic, particularly in becoming problematic.

For llms, so foreign people now have a really hard time dealing with them, because what is really hard to know there is, whether your test set is included in your training set. A real problem, like with the size of the data that we have and the way the data is being collected and to the ability that we can understand the data.

This sort of? This. This concept of IID some, you know, same thing ID from the from the distribution kind of breaks down and actually. I think it's one of the big problems that that sort of this this the community has in this space and. I'm not sure how much people are actually thinking about it, but it's really harming to the point where.

Is becoming harder and harder to know. Whether a model is actually better or not than the previous one, like, you can get better metrics on different dimensions, but somehow the model still is not better because you're not measuring the right thing. And then this comes from this, uh, this, this, this.

This distributional assumption that it's hard to maintain, so the other choice. And I think I had a slide for it. I'm not gonna go too deep into it, but there's another choice that I think sounds kind of interesting. I think it has problems as well, but it's this choice is given by the minimum description line principle.

Solomon of induction and other funky things like that, but basically all it's saying is that? The model that compresses the data. The best is a model that will generalize. This is kind of the principle, so the idea there is if you look at the? Any beats you need to store the model?

And. How many beats you need to store the data, so you look at these two terms. Basically, this is how many bits the model takes, and this is how many bits the data takes given that model. The thing that minimizes this tool? That is. I was going to generalize the better, and the nice thing about this is if you have this prequential approach.

Is that you don't need a distribution, so maybe let me explain how this works the way, the way it works, particularly just looking at this term. Is that you take a single data point and you fit the best model you can with the data point. And then, you take two data points and you feed the best model you can to data points, and you keep going for every data point and you always fit the best model that you have, and this will generate a curve, which is almost like a training curve, but not really because you for any subset of data you fit, sort of the best you can.

And then you compute the area under this curve.

The thing that has the lowest area under the curve is the thing that's going to generalize better. There is another intuitive way. This has been framed, but it's a lot more vague, but maybe it's useful. And it's not actually in the paper from Yorkshire, which says they basically the thing that learns the faster.

Is the one that's going to generalize the batter? Because it has the right inductive biases when? Like, if, yeah. So, so, basically, the the the concept here is that you don't look just at the loss that you get on the training set at the end, but you look at how quickly you get there.

And if you get faster, that means that somehow you are kind of have the right structure, and you're exploiting the right structure to get there. Therefore, you're gonna generalize as you move, move faster. And yeah, because it may interesting and is better than the idea or spin Transformers, for example, because it has a better.

After some scaling. Yeah, so in some sense. Yes, and no, so three. To be fair, this is a binish. So, in some sense, I can. Yeah, like the CNN do have the right indoccupiers when it comes to images because they assume sort of this positional invariance that the Transformers do not, and therefore you expect that the Transformers.

A lot more. To learn sort of the the true bit of of, so it's almost like if you want, you know, we stopped. We talked something about strong generalization, like in cognitude, basically including something about the images that you know it has to be true, which is like, somehow locality is important when you process images and Transformers have to discover this by themselves.

So, in that sense?

Um, confidence will generalize out of domain instead of in a more meaningful way. When, when, when you have this stress, this concept of locality somehow. And if the Transformers don't get it, exactly right when they they don't learn this exactly right from the data. They will struggle a bit more now.

Empirically, I know everyone is switching to VIT and they tend to work better. I not necessarily a vision person, so I don't know exactly sort of the reasoning behind it, but I assume it's a mix of everyone wants to use Transformers because it's sort of very popular, and you have the right libraries.

And it's a question of also scaling like certain things scale better with, um.

Quite well as well. Which is, if you have two objects that are related to each other, but they're not present close to each other. The only way you can make the connection between them is if you go very high in a convolutional layer. So when I'm going to end up talking the promotionally, I'm gonna show you that.

But basically, the depths now controls sort of how much of the image you're actually see in kamazoo Unity seeing so if you want to connect two two dots that are apart like, here, you need to go pretty high in the area here, but the problem is as you go higher.

The hierarchy you're throwing away high frequency contact, so you're only looking at a sort of a smooth version of the image. So that makes it really hard to reason about, so if you have occlusions, for example, coconuts have a really hard time dealing with occlusions because they have a really hard time connecting objects that are not connected, especially because there's some some something in between them.

Transformers don't have this issue. So, Transformers in a single layer they can jump around because they don't have this.

That is useful. You could also argue that it's not exactly the ground trust indirectly bias, either because you have this kind of occlusion things, and then it works against you so. But, but in, in spirit, that's kind of what this preferential staff is trying to say, I think you had a question as well, but I'm not sure.

Not, not sure. Yeah, I'm just trying to link. The mdl principle with, like, because that's more about, like, Efficiency, right, and, and like, and, and the idea of, like, generalization, then, is it that the cons that the information or the? Is better. Now, if it's more efficient, so it's?

The gambial principle is not. It's something that looks. It's like an information theoretic kind of principle, so it's something that looks at the data. So it's not really about on which thing you're converging faster. It's really like the true definition of it is exactly the one that I said, where you independently for each subset of the data.

You train the best model you can with that data. So, so it's like, like the way. Okay, so the way this has been done in practice. So, like this, experiment at the bottom, so this is from your experience where it is one person that likes this a lot. He is really looking at the area under sort of the training because you can't afford to train sort of a resnet, like a million times.

For every new data point that you have, so you just kind of when you say you retrain for a new data. You just do a single HDD stuff, and that's it, and you start from the previous one. So, so, in that sense, like, I'm gonna show you, if this is even your question, but in that sense, like this principle, like the way it's been used in practice, has been bastardized to be which one trains faster, but really, really.

To test us, like, what is the cutoff of data points that you need to get low training error? And if you can get away with fewer data points. That means you have a better model, so that's kind of the conceptually what's going on, but this has been bastardized into this because this is the Practical bit.

So let me just check what the next slide is. Oh sorry, yeah, I feel it. I'm a bit lost. Yeah, between the two Concepts, like, I understand the second one, which I've been discussing now about the affirmation. How does it take to the faster convergence of the like, so there is no theoretical link?

As such, it is sort of how this original, so this original ndl concept, that is, I'm not mistaking the first paper might be from the 60s, but it's been developed a lot in the 90s and early 2000s. Marcus Hutter is a big name that like to work on this quite a bit.

The way this has been translated to actual day-to-day deep learning is by converting this sort of.

And. The the reason for that is because they were just arguing like a proxy of retraining on this new data set. Is, just take the previous solution that I have and take a single HDD step on a new data set, but there are some information based networks are, for example, if I want to evaluate a language model on some text.

Yeah, I can use the aplexity or text, very character, or whatever. Yeah, and ideally, if that plexity is very low, I can't save. This model can generalize well in this, like purpose or language. I mean, the complexity is the same as the laws. The perplexity is really the likelihood, just like.

In a different range, but it's really the same likelihood object that we have right here. So, what this is telling you is not just, you shouldn't just look at perplexity. That would be sort of the usual way of doing things. If you do training validation set, but it's to look at.

Which model needs the least amount of data to get the perplexity dial? This is kind of the concept that we're that this is going for, and as a matter of generalization, just to kind of connect it today to day. So, the reason I thought it's kind of useful to present this.

This is a bit of a niche thing. So, like, you're not gonna find this in traditional sort of deep learning courses. People are not going to talk about the prequential, but I, you know, just to say that, there is a reason I, I put this in the slides.

It doesn't require a distributional assumption. It, it seems I, I didn't talk about this experiment, but it seems it can figure out spurious correlation. That is to be seen, like, this is just one experiment and one one claim from one paper. But what is interesting, is that, uh, in large?

Language models in empirically at places like Google and Facebook and whatnot. People are actually using this thing without them knowing so. So, what's happening nowadays because of the scale and the computational cost, and I can tell you this from from my own experience is when they do the pre-training stage, they do not have an explicit validation set where they look at the perplexity for that.

Select models based on, I guess, in their mind, on the training perplexity, so you know the training things and they're looking at a collection of the training, but like, if you look at the code and what they're doing in practice again, this is for computational saving engineering kind of tricks, is they usually, you know, because you're doing back back propagation, you first do the forward you get your perplexity, and then you do your backwards and computer gradient and apply the update.

So basically, what they do in practice is they first evaluate on the new data point and then take the step on the data point, which in some sense is very close to what this is saying, and then the other thing that they do, because that point is noisy, is they do a moving average, which is kind of corresponds to looking at the area under the curve.

So, this is usually the metric that, in practice, big groups ends up doing when they have these big models that you know they take most of the computer they have and also the data they have. And, um.

Not exactly done correctly, but close enough in spirit and. This is sort of a claim that. Jorgen and Marcos Hutter by people I know. But there are demon. I work with them. This is a claim they would make. Obviously, they have a bone in the fight. Because preparation evaluation is the thing they came up with, but they're trying to argue that people are actually using it day-to-day.

Think all about the point, suggesting, oh, the the people. So, in large language models, when they have to train these big models, so there's usually in. Lamps. There is the pre-training stage, and then the post training stage. Again, I I don't like these names, but anyway, the pre-training stage is really.

Just have lots of data. Uh, and you train the model from scratch and the only thing you're looking at this perplexity. Post training is usually when you start doing. You know safety training, you're trying to remove hormone language you're trying to do. Rlhf, you do instruction tuning to make the system listen to instruction and all of this.

These are all like fancy names. What it really is post training. It's a bunch of stages of fine tuning on very dedicated data sets, or we very dedicated objectives, and it's in post training that they have like evaluations that are different than perplexity, right? They would they would use different kinds of ways of lighting the system.

Rather, they have qna kind of things, and they look at how well it answers the questions, or they have all kinds of other. They have human evaluations and all of that stuff that all usually happens all in first trade. But the pre-training, which is the bit that takes most of the time, like 99.9 percent of the compute and energy are not lost in the free training.

It's really just perplexity driven, and it's just like predicting the next open. You have lots and lots of data, and the model just predicts the next token.

Computer that you need to use is so big that people usually do not like the way you do. This is you run your training model. You would ask your your training model code to save the weights regularly, and you have another process that you spend on a plaster that would load up that model and try to evaluate it on some set right, and, and so you get some kind of sense of how your validation error goes up.

But like, what people have started doing recently recently, I mean, maybe the last three four years is they don't do this validation job anymore because it's too expensive anyway, and things are expensive. What they do is they actually look at the training error that's sort of their claim. And they say, I can, only I it's sufficient for me to look at the trading error, and that will tell me when to stop.

And if my mother is doing well, and if everything is fine, like I can use the training error instead of validation error.

That you can have. If you just talk some training together, but there are two differences that they do, that makes that not usual training error, so one is the way they compute. It is the exploit the fact that if you want to compute a gradient, you first have to go forward to get a loss so that you can back propagate, because that's how backdrop works.

So, therefore, they they first evaluate on the data point, and then they take the step. Not on data that you've already trained on. It's on the data that you're going to train on, so your computer loss before you train on it. So in that sense, you're not biased, right?

Because you first use the data to evaluate, and then you use the data to trade on. The second thing is they do this moving average, and the reason for that is because otherwise, things are too noisy. But when you do the moving average. It's almost like you're integrating this area under the third string.

Sort of the the it is link, but it's connected to it, you know, yes, why? Another device using this idea of predicting that the data point and doing the gradient step on it? Why would the idea just describe? Yeah, yeah, yeah. And then, I update my weights. Yeah, what?

Everyone is not using it, even if I'm doing a very small CNN about it, and this is the closing like 20 of my assets for validation. Yeah, yeah, I mean, usually, like, okay, so there's no good theoretical grounding for this. If you talk with these kind of people that work in this space, they would say it's working because we never have enough computes to overfit the the llm anyway, so it's fine.

And that's why we can do it. But like on small models like, you know, maybe you sometimes even do it, and you don't realize, but the point is smaller models. People just look at validation error. They don't ever look at the training area like, I mean, you look at the training error for pathological behaviors, but they just you have the validation error.

It's easy to compute most of the scripts you find online would automatically have Computing the validation error, so it's just a practice thing. I mean, I think sort of what, uh, York and Marcus Otter would like is for everyone to do frequential evaluation, and that no one to use the.

But the truth is, it's um.

First of all, really says you need to train the best model you can. For every data like, this is not what we're doing so, and there's no strong Theory to say that this proxy that we're doing is good enough. So, anyway, this is like, this is also new stuff.

Like, it's. It's one of those things like, maybe in some future, we'll end up doing this. I kind of doubt it because I think there is a big gap between the mdl principle and and the sort of trick that we're doing in practice. That might break things quite a bit.

Still kind of interesting to think about this this way. Um, there is also. Plus the time. There's also maybe I should mention it because I find this a bit interesting. Um. This also sounds a little bit like black magic, so if you, if you listen to Marcus Hatter, he really makes it sound like this is the answer to all problems like, you don't need distribution.

You don't need things, but there is a secret in mdl as well, and the secret in mdl is that that Ariana this is, according to me, is that that Ariana, the curb that you are measuring is only meaningful after you've seen sufficient data and what the sufficient data means.

No one knows, so if I have a process that is something like for a million steps via linear function. After a million steps, BF abstractive function that are generating my data like, until you get to the switching point.

To model your your thing, and then when you try to evaluate it as you go further, you'll see that it doesn't work anymore, because at some point it become fantastic. So really, there is this concept that, like, sure, the thing that compresses the most is the closest to the true program, because that's the most compressed form of the thing, but that is, if you've seen sufficient of the data generated by that process, some somehow covers the entire program.

Well defined in the ndl literature, right? So, in the ideal literature, they say, well, you can do this trick. And if you do it. It's fine, but I think there is a Dutcher there where, like, you can always get the wrong decision if you somehow do not go far enough in how much data you see.

Yes, yeah. But at the scale, you said, this is implemented in practice, then that that isn't really an issue. Yeah, for for the nlm stuff, that's not an issue, but I'm just saying this more of. The right way to do model selection, like, say for toy things that's not necessarily clear to people that there is a point where you can't really trust the system unless you've interacted with it enough.

And I think that's never. If you read newspapers, he never talks about that. He only says, I don't need a distribution, and this works, and it's sufficient to look at that end of the curve. And that's the only thing you need. So, how much time I think we have quarter 15 minutes, so I'll try to push for more more slides because I was hoping to be further away in my back?

But yeah, so okay. So this was a this was like a side note this whole frequential learning, but if we go more traditional learning? And this is maybe things you already know. This is kind of the picture you'll find in a book and the kind of thing people will tell you, which is, you usually have these three regimes when you're playing with the model capacity, which is one is you're under fitting.

You have a small model you're trying to train on the data. And then, in this scenario, the the validation loss, the test loss, and the train loss. They track each other, and they start going down. Model is just right. It has the right capacity. And that's when the the test and the train are the lowest.

And then, as you keep training, or as you make the model bigger? The training will keep going down. But the, but the test is going to go up. And this is when you're gonna start fitting the noise. So, this is exactly. This picture, right? So, this is when you're in a good fit, and if you keep increasing your amount of capacity, you're gonna get up this, right?

You're gonna get to to drive the training error even lower, but at the the expense of the of the validation. And, and how does this play for neural networks the way people think of? This is the capacity is the size of the mountain, right? So, of course, you have the linear model versus maybe a quadratic model, but then, in the neural network space, once you move to new natural space.

All neural networks are, you know, Universal approximators, but their expressivity. Their capacity is defined by the size of the waves, like how many ways you have, how many neurons you have. So usually, this is also intuition. You know, you need to pick the right size model for your problem.

Like, that's not too big. Does it relate to the double design Theory? Yes, this is exactly related to double descent, so this is sort of what an old school kind of introduction to to generalization will tell you, but that in practice, we know that this is incorrect. Is, we know that the bigger the model, the better.

And if you keep making it bigger, it's even better. And this is sort of where double descent comes in and maybe, oh, I had a different slide before that. So before I talk about double descent, I just want to say that you can control things by by the model size, but you also can control things by the amount of training steps.

The way to intuitively understand that is, even though you have a neural network that's infinitely wide. It's Universal approximator is the expression that you want. If you limit the number of steps of a GD step that you do, then you limit the number of functions you can reach, right?

Because you can't go through all the parameter space, you can only travel this far, so that's another way of limiting capacity. So, number of steps that you do is the same as limiting number of parameters. You can either make the model very big and limit the number of steps or you, you just have any many number of steps, so just attack.

Yeah, because I'm kind of like swapping between this. Uh, and, and maybe that's a bit confusing. And and, and, and another way of controlling capacity is through regularization. Sorry, I'm gonna get to double descent after this. Uh, and and regularization is really like if you have a loss that has multiple local minimas or multiple minimas.

You have a regularizer that has only one. Minima say, this is an L2, so it has a minimize zero when you sum these two surfaces together. You basically make this Minima have different values now. Because you know the regularizer will prioritize the Minima that's closer to zero. Away from zero to the other higher value to it.

So that's a way of thinking of, like how the regularizer solves the? Um, the problem is by prioritizing things that are closer to you. I'm gonna skip over that because it looks like a lot of mans. So this is just really showing, and this we talked about it before.

This is showing that, like if you're being proper Beijing. But if you're being proper patient from being proper, Beijing, you get basically a regularization term, which is your prior. So, I think we talked about this, so usually if you're being properly patient when you try to optimize things, you get that your objective is the negative log likelihood, so it's how well you fit the data.

Plus, how well you respect your prior, and if you pick your prior to be sorry. A gaussian Center zero that will end up being an L2, so that basically says that I prefer solutions that have small arms. So, so this is just sort of a probabilistic way of getting to the regularization turn, and it's basically sort of just a natural outcome of having a prior, um.

And um, this is kind of. Yeah, maybe there's not that much to say around here. I'll come back to this when I actually talk about contents, but just to tell you, there's we have many techniques to regularize all of this are constraining the problem. So here, now you have to learn that both these sleep versions of the image are a catch, so you have to learn more.

So, you use more of the capacity, so you know? Slide. Okay, there's quite a few slides before the double descent. Do I have time to control that much light? Jump to the double descent. Me, just try to go through all of them. I, they should be after this. You should be the double descent.

So, we're we're on the regularization point. There is another observation, so we talk, so these are more traditional ways, or at least sorry. This one is a super traditional way of regulating the models. This is sort of like textbook, I don't know 2000 way of regularizing the model. The there is other funkier ways of regularizing a model.

This idea was thrown out was from from your girl in 1997, but is this idea of flat, sharp minimum? So there is this effect that people have found it was very surprising for a long time. We were doing mini batches GD because we couldn't afford to do anything else and at some point gpus caught up.

There were extremely powerful, so people were like, let's just do full GDs because you know, now we can. And that's the right thing to do. And then, when you do full GD, it turns out that. Is my choice, and then sort of. There was this big debate of what's going on, like, why is HDD helpful?

And it turns out the the one of the hypotheses that is kind of sticking around. This turns out that the noise is helpful because it allows you to escape sharp minimum, so the intuition here is you have a loss like this where you have something that's very narrow, and then a flat Minima because of the noise in in.

In SGD, you cannot converge here. The noise will push you out, so the only only minimal you can converge to is the Minima that's wide enough that it's much wider than the variance of your noise.

More like mdl kind of principle for why this is useful. So the argument is the flat minimized a much more compressed. Like, you need less bits to describe it, but another way to think about it is if this is my training loss. I don't have a slide on this, but if this is my training loss, you can imagine that the test loss is just basically a noisy version of this is this, but like shifted a little bit and whatnot and the the thing here is for the Teslas.

If you shift a little bit here because this is narrow, you're already. Your loss is going to be very high, but because this is flat if you add any noise, the loss will kind of stay the same. So that's another intuitive way of understanding Wi-Fi system and and the the concept.

Here is the stochastic gradient descent.

Just a quick question. Yeah, sorry if you have time. Yeah, why? Most people are not utilizing this like they focus on regularization of learning, great schedule or whatever. But for example, like we don't see something like? Batch size schedule. Example I start with, like, very large to regularize the noise.

Exactly, yeah, why? Generally, training people don't mess around so much with flashlight. Just select effective value like, 64. Yeah, so I don't necessarily have a good answer. Like, my expectation is probably the engineering work that you have to do to do. That doesn't buy you, you know? It's basically not worth it for the improvements that you see, because I mean, depending on your setup.

Like, if you're playing with C5 or whatnot, probably there's no engineering work. But if you're playing with like a large data set. To serve it in different ways throughout and sort of how you're sampling and and things like that. There is another reason maybe correlated to that which is with the change of the batch size.

You need to change the learning rate. There is a Formula of going around of, like, what is the correction of the learning rate? If you change the bite size for that formula is not ground truth, so you know, then you can't really stop Midway to like return your learning rate because you change your bad size.

Um, and then, like, you know, if you're doing Adam or things like that, you might need to, like, I actually even haven't mentioned this. But in Adam, you have at least two hyper parameters, which is the the beta one and beta2. So these are the the moving average of your momentum and the moving average of your sub square gradients those might depend on the base size as well, and maybe you need to.

So, it's just sort of all of this side questions that people don't want to answer. So it's always just easier to have a fixed by size and tune it whether, like, like.

Which is sharpness away minimization, which is a take on SGD, which the main goal of Sam is to find flat Minima. So there are algorithms and that was quite used to some point. So there are variations of atom and so on. That are particularly framed because of this and trying to exploit this.

How much time do we have? Five minutes. I'm not gonna get to the double descend, but to link that for tomorrow, so I'm just gonna continue this. So I decided to say because this is actually a cool thing, so you have to slash everything. This is actually by far.

The intriguing hypothesis that is true in a lot of spaces and. It's like one thing that is kind of helping to explain a lot of the things that are going on. So, so I think if there is any kind of take home messages, you know, understanding the sharp, flat minima Behavior.

It's kind of important because detecting a lot of things. There is another framing of this, which I'm not going to go into the math of it, but uh, um. Is incorrectly they. They did this thing where they're looking at the updates that you're doing. And they work backwards from that.

To find a loss where if you would follow that loss, if you follow the the flow of that loss. So, if you describe personally, you get exactly the behavior of this and and the whole point of this. Okay, so the detail doesn't matter. The whole point of this is that they managed to identify an alternative loss.

Your normal updates are minimizing. And that alternative loss is your original loss, plus the regularization term on the norm of the gradients. And this regularization term on the normal. The gradients is basically trying to do the same thing. It's saying that the Minima you find it has to be flat, so it's coming up in many places.

And the interesting thing about this line of work is that GD has this form. Actually, you get a different form. And then this is kind of interesting, because now you can write down what is the implicit regularizer that GD is imposing, and what is the implicit regularized that SGD is imposing, and because they behave differently, you know, regularized that are different, and you can sort of understand where the difference is coming from.

So, the difference? Sorry, I kind of stiff over this weekly, but I'm gonna stop after this. Just penalize the norm of the full gradient from the entire data set. For HGD, you penalize the gradient on each of the mini batches. And then, if you're thinking about the space. Say, if you have some directions that cancel out in the averaging of the full gradient, you're not going to impose flatness in those directions.

While here, you're going to impose flatness more widely around you. So, that's why HGD works better is because it finds flatter minimum sort of all around. And the last part of the slide is that? While flatter is better. There are also some questions, so one, for example, you assume that about from this hypothesis that adding noise is useful because it helps you go flatter.

It turns out that it's only certain types of noise that are useful, like chat. Ocean noise is not going to help you. The optimization is going to go, so it's very frustrating, so it's. There is something definitely about noise, but it's also definitely about the kind of noise you get from stochastically sampling data points.

Like, for example, here we try to use noise from data augmentation and that hurt you so getting rid of that noise. Actually, you get better performance. Does this relate to the way you mentioned the Deep learning models perform better with natural data? Yeah, yeah, so here is like, we're basically trying to see.

Yeah, I mean, this is just one facets of that, but we're trying to see data augmentation, where we pick standard data augmentation for images. If that noise is better, it's also as good as as noise coming from sub-sampling, the data to kind of force you to generalize better, and the answer is not.

One such a Minima, but like the the nature of the noise that comes from stochastic gender data makes it flatter in the right way, and what that means we don't know. But it's not like any anything that you add on top of your HDD updates will help you. That noise.

Okay with this. I think I'll have to stop here because otherwise I'm gonna run over, and I think that's been quite a bit. Let me see. No, okay, so there's quite a few more slide. Oh, quite a few more sliders. Yeah, okay, so we'll, we'll catch up on that.

Tomorrow morning, I think, oh, thank you, thank you. Thank you.