February 5, 2014

ml-404: Feature Choice and Polynomial Regression

Hi. It's good to see you once more. Today we will learn a particular technique to choose features that might improve the performance (in terms of prediction accuracy) of the regression algorithm, and in turn look at a subset type of regression that results from it.

In one of the initial posts when we started learning the linear regression ML algorithm, we had used the following data

size in ft2 (x) price ($) in 1000's (y)
2104 460
1416 232
1534 315
852 178

and called our problem as the "housing prediction" problem. This particular problem is the "univariate" type of linear regression because we have only 1 input feature, viz the size of the house in sq-ft.

However, it is very easy to assume that we might have got our input features in the form of "length" and "depth", that is as x1 and x2 instead. But, maybe, we had some background domain knowledge to know that the price of the house is more dependent on the area (length * depth), instead of the individual features of length and depth themselves. So in this case, it's like we created a new feature x3

x3 = x1 * x2;

and used it in place of x1 and x2. Maybe, we also knew, due to domain knowledge, that given the area (x3), the features of length (x1) and depth (x2) do not play any significant role in the prediction of the price (y), and hence can be dropped; thus making y dependent only on x3.

The point to note here is that x3 is no longer a simple variable/feature, but in fact is the product of 2 (x1 and x2) simpler features. Thus, while the graph of x1 versus y might have been linear, the graph of x3 versus y would appear to have a quadratic shape (having some sort of a curved nature).

Similar effect (in terms of the graph plot) would appear if we would have used a polynomial power of the base variable itself. For example, if

y = x1 ^ 2;

or

y = x2 ^ 3;

etc, would give the curve of the input versus output variable a different shape.

![]()

![]()

![]()

Looking at the plots of your input versus output variable, you might get an idea a polynomial feature might be needed. In such cases, armed with some domain knowledge, if one can play with choosing the power of the input feature variables, it is possible to radically improve the performance (in terms of prediction accuracy, not speed) of our implementation.

Since, in these cases, we use a polynomial power of our input variable(s), these algorithms are also known as "polynomial regression" type of ML algorithms. Of course, if one has domain knowledge and already knows some relationship (although not exactly) between the input and output variables, then it can be extremely helpful.

In future posts we will take a look at some advanced algorithms which can automatically find out such relationships, and other algorithms which can automatically choose only those features which improve the performance of the implementation significantly and drop the others. So keep watching this space to learn about them :)