Make today better than yesterday!: ml-406: Normal Equation -- part

February 7, 2014

ml-406: Normal Equation -- part - 2

Hello there! Welcome back. In the last post we took a look at another algorithm, viz "normal equation" to solve the linear regression ML problem. However that was, maybe, too theoretical in nature. Today we will take a concrete example and see how to implement the solution for the same in octave. Are you ready? If so, lets move on…

The concrete example for today is as follows (including the pivot feature x₀):

	size(feet²)	#bedrooms	#floors	age (years)	price($1000)
x₀	x₁	x₂	x₃	x₄	y
1	2104	5	1	45	460
1	1416	3	2	40	232
1	1534	3	2	30	315
1	852	2	1	36	178

In matrix forms, it gives rise to the following X (of size mx(n+1)):

1	2104	5	1	45
1	1416	3	2	40
1	1534	3	2	30
1	852	2	1	36

and y (of size mx1):

460

232

315

178

If that is the case, then by implementing matrix algebra, theta can be solved as:

theta = (X^T X)^-1 X^T * y;

In octave, the command to solve the above is

theta = pinv(X' X) X' * y;

That's it! Ain't it cool how such a complex concept can be expressed in just 1 line-of-code in a high level language like octave? Hence, as I had mentioned here, it is extremely important (and useful) to first write solutions for any ML problem in a high level language (like octave, matlab, python, etc) and only when we are sure that we have the right solution, to port them to other faster languages if need be.

Coming back to our problem at hand. Aren't you wondering that if it is so simple to solve for the values of theta, then why did we learn the "gradient descent" algorithm in the first place? Well, obviously, because there are scenarios where the normal equation method does not work. To understand this, lets compare the 2 methods, viz gradient descent and normal equation, and see their relative advantages and disadvantages.

Lets look at the advantages of normal equation method over gradient descent:

there is no need to do feature scaling
there is no need to choose the learning rate "α"
there is no need iterate

Now lets look at scenarios where normal equation method does not work:

when the number of features (n) is very large, then the cost to compute the inverse of the nxn matrix (X^T * X), is extremely large (order of O(n³)). On modern computers, for n < 10000, the implementation will work (speed will decrease as n increases), but for values of n above 10000, the processing might never complete.

Although, at this stage of our study, n of the order of 10000 looks extremely big, do note that:

in real world problems, it is possible to have these many (and more) features
when you consider the "polynomial regression" algorithm from this post, one can easily see how n can grow very fast

Finally, there is 1 more special case when the normal equation algorithm might get stuck. This is when the matrix (X^T * X) is non-invertible, that it does not have an inverse. This should happen in only the following 2 cases:

some features x_i and x_j are linearly dependent on each other (for example, x₁ = size in feet, and x₂ = size in meters). in such a case we need to find out such features and remove one of them
#training-examples (m) <= #features (n). that is, there is not enough data to fit all features and find the optimal theta. in this case we need to
either reduce features
or add more data

Both the above situations are easily avoidable, which leaves us with the only limitation of "n < 10000", thus making the normal equation algorithm a very attractive choice while solving linear regression ML problems.

This also brings us to the end of the 40x series. In the next post we will start looking at other ML problems and algorithms to solve them. So don't go away. Keep watching this space :)

« ml-501: Logistic Regression ml-405: Normal Equation -- part - 1 »

ml-406: Normal Equation -- part - 2

Recent Posts

Tags