January 29, 2014

ml-304: Gradient Descent and Learning Rate

Hi. In the last post, we learnt about the "gradient descent" algorithm that is used to minimize our cost function "J(Θ0, Θ1)". Gradient descent algorithm was defined as:

repeat until convergence {
  Θj = Θj - α δ/δΘj J(Θ0, Θ1) (for j = 0, 1)
}

where,

  • α: is called the "learning rate"
  • δ/δΘj J(Θ0, Θ1): is the partial derivative of the cost function (J(Θ0, Θ1)) with respect to Θj

Today we will try to understand the working of the gradient descent algorithm itself. We will also learn about the role of the learning rate "α". Let us start by understanding a single step of the algorithm, viz

Θj = Θj - α δ/δΘj J(Θ0, Θ1) (for j = 0, 1)

As we have mentioned above, the term "δ/δΘj J(Θ0, Θ1)" is the partial derivative of the cost function (J(Θ0, Θ1)) with respect to Θj. The partial derivative of a function defines the "slope" of that function at a given point. That is, in this case, this term defines the slope of our cost function at a given point.

![]()

We also know, from this post, that the term Θ defines the slope of our hypothesis function "h(x)". So in a single step of the gradient descent algorithm, we are trying to change the slope Θ by a small amount (given by α times the slope of J).

![]()

How "small" is the change, is what is determined by the learning rate "α". If "α" is too small, then the changes will be very small and our gradient descent implementation will take a long time to converge (see the green points in the graph below). Whereas if "α" is too large, then gradient descent will take very big steps, and it is even possible for it to overshoot the minimum value and even diverge. But if "α" is just right (see red points in graph), then the implementation will take just the right steps and converge soon to the local minimum.

![]()

As such, the gradient descent algorithm can be used to find any minima (local or global). But because we have defined our cost function in such a way, that it is always convex and always has a global minima, our gradient descent implementation will always find the global minima!