February 4, 2014

ml-403: Feature Scaling

Hello. Welcome back! I hope that you are enjoying this series on machine learning, especially the last 2 posts here and here where we implemented the "multivariate linear regression" ML algorithm and even improved its performance. In this post we will look at one more technique which should be applied to the multivariate version of the linear regression ML algorithm to improve its performance, viz "feature scaling". The reason why I say that this technique should be used for the "multivariate" version, is because by using this technique we solve a certain problem from which the "univariate" version does not suffer.

Lets take a concrete example to throw more light on what I am trying to say. For this, let's consider this data, a small snippet of which is shown below:

18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu"
15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite"
16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst"
17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino"
15.0 8 429.0 198.0 4341. 10.0 70 1 "ford galaxie 500"
14.0 8 454.0 220.0 4354. 9.0 70 1 "chevrolet impala"
14.0 8 440.0 215.0 4312. 8.5 70 1 "plymouth fury iii"
14.0 8 455.0 225.0 4425. 10.0 70 1 "pontiac catalina"
15.0 8 390.0 190.0 3850. 8.5 70 1 "amc ambassador dpl"

If you remember, we have seen this dataset earlier here. Let's ignore the first column for now, because that is the target variable, so that leaves us with 8 features. We notice that (in our limited dataset snippet):

  • 1st feature ranges between 4 and 8 (in the complete dataset)
  • 2nd feature ranges between 300 and 500 (approx)
  • 3rd feature ranges between 100 and 250 (approx)
  • 4th feature ranges between 3400 and 4500 (approx)
  • 5th feature ranges between 8 and 12 (approx)
  • 6th feature ranges between 70 and 85 (in the complete dataset)
  • 7th feature ranges between 1 and 3 (in the complete dataset)
  • 8th feature is a string (so we'll ignore it for this discussion)

The reason that we are interested in the "range" of values that our features take is because depending on the relative ranges that the feature variables take the shape of the cost function will differ. And this will impact the performance of our implementation.

I hope that you remember from this and this posts that our cost function "J(θ)" always has a convex (like a bowl) shape. When we see a bowl from the top, we see a circle. But for our cost function, it can vary between being circular to an extremely thin ellipse. Whether it will be a circle or an ellipse is determined by the relative ranges of the features.

Also, the more circular the shape of our cost function, the better is the performance of the gradient descent implementation. That is because in this case the gradient descent needs much lesser number of steps (iterations) to reach the global optima. As against this, if the shape is narrow, then the cost can oscillate around the major axis of the ellipse and take lot of iterations to reach the global optima.

In order for the shape of our cost function to be as circular as possible, we must modify our input variables such that their ranges are similar. This process is known as "feature-scaling". Although, there are various methods to achieve this, the most commonly used technique is the "mean normalization" and is defined as

xj = (xj - mj) / sj; where

mj = average (or mean) value of xj

sj = range (max - min) OR std. deviation of xj

j = jth feature

By implementing the above on all our input variables, we make the data of the form

-1 <= X(i, j) <= 1

One thing to note though is that we need not follow the above rule very religiously; approximate values are ok too. For example,

-1 <= X(:, 1) <= 1 % 1st feature

5 <= X(:, 2) <= 10 % 2nd feature

-100 <= X(:3) <= -50 % 3rd feature

is fine. But the below is not

-1 <= X(:, 1) <= 1 % 1st feature

5 <= X(:, 2) <= 1000 % 2nd feature

-100 <= X(:3) <= -50 % 3rd feature

Of course, even with the above data the implementation will work, but it might take a very long time to converge.

Another very important thing to remember is that we should not scale the pivot feature x0.

I think that's it for today. In the upcoming posts we will study other ML algorithms and also understand which to use when. So stay tuned!