gradient descent algorithm
Trainee at AlmaBetter at almaBetter
Disclaimer: This article is written with humorous intent. The Reserve Bank of India is admired globally and is the epitome of excellence in monetary policy and economic decision-making. Indian Police officers, in fact all UPSC personnel, are crème de la crème meritocrats. We do not condone any heinous activities.
The professor and his team are planning the heist in the RBI bank because they came to know that India is developing at an incredible rate. The professor is explaining how to minimize the risk of getting caught, by the Indian police, using the Gradient Descent Algorithm.
Tokyo: What is this Gradient Descent Algorithm?
Professor: Gradient Descent Algorithm is used to minimize a function by optimizing its parameter.
Raquel: Which function does it minimize?
Professor: It minimizes the cost function.
Denver: What is the cost function, and what is the need to minimize it?
Professor: The cost function represents the error between the actual value and the predicted value. Put another way: by minimizing the cost function, we are minimizing the error between the actual value and the predicted value.
Rio: How will this FUNCTION help us not to get caught?
Professor: By minimizing the function of the risk of getting caught. The Gradient Descent Algorithm is used to minimize the given function. In our case, it is a risk function. It does so by first calculating the slope of a function by the first-order derivative of the function and, second, it moves in the opposite direction to the gradient by alpha times the gradient at that point. (Professor took the chalk and wrote this on the blackboard)
New Value = Old Value - Step Size
Professor: In the world of mathematics, the ‘Step Size’ will be equal to the product of Learning Rate and Slope. So the above equation can be written like this.
Tokyo: Professor! In a simple language, please.
Professor: (Professor looked at Tokyo and slowly nodded and took the black color remote of the projector and presented this slide) In simple language, suppose this is a function that we want to minimize. And we want to find the value of X when Y is minimum. For this, we will be using Gradient Descent Algorithm.
Denver ((hastily answered): zero, zero X value should be zero for Y to be minimum.
Raquel: Why use Gradient Descent when we can clearly see the minimum value?
Professor: In this case, we are using a single variable. So the minimum value of a function is apparent. But what if there are multiple variables in the function? Say, 10 or 50? Then, how would you inspect the minimum value of that function? And by the minimum, I mean “local minima”, not global minima; because this algorithm finds the local minima of the function. For this very reason, we are using the gradient descent algorithm. Here we will start with some approximate values, and then we will be moving towards local minima. Let’s see this diagram.
Tokyo: Hold on, Professor. Do you want to risk our life by taking some approximate values?
Rio: Tokyo… calm down Tokyo.
Raquel: I am still not satisfied with this algorithm. Ah…Just show me what the graph of risk looks like after we enter the bank.
Rio: Yeah and also the risk function that we have to minimize. We’re supposed to minimize this function with respect to what? I am just asking what is X in our risk function?
Professor(calmly stares at Rio and then Raquel for a few seconds and then again to Rio and says): Time…We have to minimize the risk function with respect to time. Here’s how the graph of risk versus time looks like (professor took the remote and changed the slide) We need to escape from the bank at point A because after that it will become almost impossible to escape.
Denver: Okay. I want to ask why we are focusing so much on the Gradient Descent Algorithm. Can you please give some of the benefits of this algorithm?
Professor: Sure. There are around 4 major benefits of using this algorithm.
Fast enough to scale on big data.
The exact method will take more time. And we will never do it exactly, so we have to go with approximation.
The loss function gives the direction of the Optimal Solution.
Easy to understand.
See this graph of what this algorithm does. (Professor took the remote and showed this slide to everyone)
Professor: It is relatively very easy to understand.
Tokyo: Easy! What do you mean that I am a fool and everyone here is a genius? Come on, Professor. We will just keep on calculating these so-called minimum values of a function, and the Indian police will get us and maybe even shoot us. The lecture is over, and I am done with this ALGORITHM (Tokyo angrily left the class).
Meanwhile, the Indian Government came to know about this plan from their intelligence agency, and they have decided to contact AlmaBetter for a better understanding of this algorithm.