Make today better than yesterday!: ml-303: Gradient Descent

January 28, 2014

ml-303: Gradient Descent

Hi, welcome back! I hope you have been following our machine learning algorithm series in the last few posts. If you have, then you know that till now we have developed the basic framework for a particular algorithm called "univariate linear regression" and the only step pending is to minimize our cost function "J(Θ₀, Θ₁)" so that we can find the optimal values of Θ₀ and Θ₁. Once we have these values, we can substitute them and arrive at the definition of our hypothesis function "h(x)" which will let us predict the values for our target variable "y".

I hope that things are clear till here. If they are, then lets move on to our last and final step of the algorithm, that is to minimize our cost function "J(Θ₀, Θ₁)". The general outline of the steps that we will follow to achieve this are:

start with some value of Θ₀, Θ₁ (a common choice is "0, 0")
keep changing Θ₀, Θ₁ to reduce J(Θ₀, Θ₁) until we hopefully end up at a minimum

There are various minimization algorithms available, but the most commonly used (and the one that we will study now) is the "gradient descent" minimization algorithm. It is defined as below

repeat until convergence { Θ_j = Θ_j - α δ/δΘ_j J(Θ₀, Θ₁) (for j = 0, 1) }

where,

α: is called the "learning rate"
δ/δΘ_j J(Θ₀, Θ₁): is the partial derivative of the cost function (J(Θ₀, Θ₁)) with respect to Θ_j

Note that in the above algorithm, in any 1 step, all of the Θ's should be updated simultaneously. So,

temp0 = Θ₀ - α δ/δΘ₀ J(Θ₀, Θ₁)

temp1 = Θ₁ - α δ/δΘ₁ J(Θ₀, Θ₁)

Θ₀ = temp0

Θ₁ = temp1

is correct, but the following is not

temp0 = Θ₀ - α δ/δΘ₀ J(Θ₀, Θ₁)

Θ₀ = temp0

temp1 = Θ₁ - α δ/δΘ₁ J(Θ₀, Θ₁)

Θ₁ = temp1

That is because in the second set of commands, when we are calculating temp1, we use the new value of Θ₀, instead of the old one. This is a very important step and missing this can lead to a lot of difficult to find errors, so be careful about implementing this step.

That's it for today. In the next post we will dig deeper to get an understanding of how gradient descent works to find the minimal value of the cost function (J(Θ₀, Θ₁)) and in turn help us to find the optimal values of Θ₀ and Θ₁. We will also see what is the learning rate "α" and how it's value will have an impact on the performance of our gradient descent implementation.

« ml-304: Gradient Descent and Learning Rate ml-302: Intuition for Linear Regression »

ml-303: Gradient Descent

Recent Posts

Tags