February 6, 2014

ml-405: Normal Equation -- part - 1

Hello! Today we will learn an amazing ML algorithm that allows us to find the hypothesis function for a regression problem in just 1 step. Sounds exciting? So lets move forward with full force!

As we already know from here, the steps to solve a regression problem are:

hypothesis function is defined as:

hΘ(x) = Σj=0n Θj xj

cost function is defined as:

J(Θ0, Θ1, Θ2, …, Θn) = (Σi=1m Σj=1n (hΘ(x(i)j) - y(i))2)/(2*m)

While the gradient descent steps are:

repeat until convergence {

Θj = Θj - α*(1/m)* (Σi=1m (hΘ(x(i)j) - y(i)) * (x(i)j)), for j = 0 to n

}

Notice that the last step, viz the "gradient descent" is a loop, that is it takes an "iterative" approach to minimize the cost function, and thus find the optimal values for Θ (that is Θ0, Θ1, …, Θn).

Wouldn't it be great if we could skip this whole iteration step, and find the optimal values for Θ directly? Well, it is certainly possible; and that is exactly what we will learn today. This method of mathematically finding the optimal values of Θ directly (in just 1 step) is known as the "normal equation" method. Let us try to understand how it works by first solving a simple problem and then going on to the more general but difficult problem definition. So for now, let us assume that Θ is a scalar (real number), instead of a vector (Θ0, Θ1, …, Θn). So our cost function is defined as:

j(Θ) = a Θ2 + b Θ + c;

The mathematical way to find the optimal value of Θ that minimizes the above cost function is to find the derivative of the cost function and set it to 0 as follows

2 a Θ + b = 0;

This is an equation with a single variable (Θ) and can be easily solved. The resulting value of Θ is the optimal value that we are looking for! Ain't that cool?

To generalize, lets look at our original cost function again:

J(Θ0, Θ1, Θ2, …, Θn) = (Σi=1m Σj=1n (hΘ(x(i)j) - y(i))2)/(2*m)

According to calculus theory, if we take the "partial derivatives" of the above cost function with respect to each of the Θs (Θ0, Θ1, …., Θn), set them to 0 (zero) and solve, we will get the optimal values for each of our Θj (j = 0, 1, …, n)! Now these values can be easily substituted in the hypothesis function and used to predict values for "y".

Ain't this cool? If it is, keep watching this space for the next post where we will look at the implementation of this method in octave, and also look at what advantages and disadvantages does this method have when compared with the gradient descent algorithm.