Make today better than yesterday!: ml-405: Normal Equation -- part

February 6, 2014

ml-405: Normal Equation -- part - 1

Hello! Today we will learn an amazing ML algorithm that allows us to find the hypothesis function for a regression problem in just 1 step. Sounds exciting? So lets move forward with full force!

As we already know from here, the steps to solve a regression problem are:

hypothesis function is defined as:

h_Θ(x) = Σ_j=0ⁿ Θ_j x_j

cost function is defined as:

J(Θ₀, Θ₁, Θ₂, …, Θ_n) = (Σ_i=1^m Σ_j=1ⁿ (h_Θ(x⁽ⁱ⁾_j) - y⁽ⁱ⁾)²)/(2*m)

While the gradient descent steps are:

repeat until convergence {

Θ_j = Θ_j - α*(1/m)* (Σ_i=1^m (h_Θ(x⁽ⁱ⁾_j) - y⁽ⁱ⁾) * (x⁽ⁱ⁾_j)), for j = 0 to n

}

Notice that the last step, viz the "gradient descent" is a loop, that is it takes an "iterative" approach to minimize the cost function, and thus find the optimal values for Θ (that is Θ₀, Θ₁, …, Θ_n).

Wouldn't it be great if we could skip this whole iteration step, and find the optimal values for Θ directly? Well, it is certainly possible; and that is exactly what we will learn today. This method of mathematically finding the optimal values of Θ directly (in just 1 step) is known as the "normal equation" method. Let us try to understand how it works by first solving a simple problem and then going on to the more general but difficult problem definition. So for now, let us assume that Θ is a scalar (real number), instead of a vector (Θ₀, Θ₁, …, Θ_n). So our cost function is defined as:

j(Θ) = a Θ² + b Θ + c;

The mathematical way to find the optimal value of Θ that minimizes the above cost function is to find the derivative of the cost function and set it to 0 as follows

2 a Θ + b = 0;

This is an equation with a single variable (Θ) and can be easily solved. The resulting value of Θ is the optimal value that we are looking for! Ain't that cool?

To generalize, lets look at our original cost function again:

J(Θ₀, Θ₁, Θ₂, …, Θ_n) = (Σ_i=1^m Σ_j=1ⁿ (h_Θ(x⁽ⁱ⁾_j) - y⁽ⁱ⁾)²)/(2*m)

According to calculus theory, if we take the "partial derivatives" of the above cost function with respect to each of the Θs (Θ₀, Θ₁, …., Θ_n), set them to 0 (zero) and solve, we will get the optimal values for each of our Θ_j (j = 0, 1, …, n)! Now these values can be easily substituted in the hypothesis function and used to predict values for "y".

Ain't this cool? If it is, keep watching this space for the next post where we will look at the implementation of this method in octave, and also look at what advantages and disadvantages does this method have when compared with the gradient descent algorithm.

« ml-406: Normal Equation -- part - 2 ml-404: Feature Choice and Polynomial Regression »

ml-405: Normal Equation -- part - 1

Recent Posts

Tags