ml-303: Gradient Descent
Hi, welcome back! I hope you have been following our machine learning algorithm series in the last few posts. If you have, then you know that till now we have developed the basic framework for a particular algorithm called "univariate linear regression" and the only step pending is to minimize our cost function "J(Θ0, Θ1)" so that we can find the optimal values of Θ0 and Θ1. Once we have these values, we can substitute them and arrive at the definition of our hypothesis function "h(x)" which will let us predict the values for our target variable "y".
I hope that things are clear till here. If they are, then lets move on to our last and final step of the algorithm, that is to minimize our cost function "J(Θ0, Θ1)". The general outline of the steps that we will follow to achieve this are:
- start with some value of Θ0, Θ1 (a common choice is "0, 0")
- keep changing Θ0, Θ1 to reduce J(Θ0, Θ1) until we hopefully end up at a minimum
There are various minimization algorithms available, but the most commonly used (and the one that we will study now) is the "gradient descent" minimization algorithm. It is defined as below
repeat until convergence { Θj = Θj - α δ/δΘj J(Θ0, Θ1) (for j = 0, 1) }
where,
- α: is called the "learning rate"
- δ/δΘj J(Θ0, Θ1): is the partial derivative of the cost function (J(Θ0, Θ1)) with respect to Θj
Note that in the above algorithm, in any 1 step, all of the Θ's should be updated simultaneously. So,
temp0 = Θ0 - α δ/δΘ0 J(Θ0, Θ1)
temp1 = Θ1 - α δ/δΘ1 J(Θ0, Θ1)
Θ0 = temp0
Θ1 = temp1
is correct, but the following is not
temp0 = Θ0 - α δ/δΘ0 J(Θ0, Θ1)
Θ0 = temp0
temp1 = Θ1 - α δ/δΘ1 J(Θ0, Θ1)
Θ1 = temp1
That is because in the second set of commands, when we are calculating temp1, we use the new value of Θ0, instead of the old one. This is a very important step and missing this can lead to a lot of difficult to find errors, so be careful about implementing this step.
That's it for today. In the next post we will dig deeper to get an understanding of how gradient descent works to find the minimal value of the cost function (J(Θ0, Θ1)) and in turn help us to find the optimal values of Θ0 and Θ1. We will also see what is the learning rate "α" and how it's value will have an impact on the performance of our gradient descent implementation.