February 1, 2014

ml-401: Multivariate Linear Regression

Hi! In the last post, we completed learning our very first machine learning algorithm, viz "linear regression". As you might remember from here, this is a type of "supervised learning" ML algorithm. However, during our study, we had only 1 feature (or input variable), so that what we learnt was really the "univariate linear regression". Today we will take a look at how to handle a more real life situation, that is when we have more than one feature. Such an algorithm is called the "multivariate linear regression" ML algorithm.

To recap, for univariate linear regression, the steps were:

  • x => input variable/feature (in "univariate", we have only 1 feature)
  • y => output variable/target (this is the value that we need to predict)
  • m => #training examples
  • h(x) => hypothesis function, defined as

hΘ(x) = Θ0 + Θ1 x

  • cost function => J(Θ0, Θ1) = (Σi=1m (hΘ(x(i)) - y(i))2)/(2*m)
  • minimization of the cost function was done using the "gradient descent" algorithm which was implemented as

repeat until convergence {

Θ0 = Θ0 - α*(1/m)* (Σi=1m (hΘ(x(i)) - y(i))), for j = 0

Θ1 = Θ1 - α*(1/m)* (Σi=1m (hΘ(x(i)) - y(i)) * (x(i))), for j = 1

}

For the "multivariate" version of "linear regression" ML algorithm, we will have multiple features x1, x2, x3, etc. We define one more notation

  • n => number of features

In this case our hypothesis function "h(x)" will change as:

hΘ(x) = Θ0 + Θ1 x1 + Θ2 x2 + Θ3 x3 + … + Θn xn

or

hΘ(x) = Θ0 + Σj=1n Θj xj

It is customary (makes later maths easier) to add an extra feature x0 whose value is always "1", so that h(x) becomes

hΘ(x) = Σj=0n Θj xj

Our cost function too will change as:

J(Θ0, Θ1, Θ2, …, Θn) = (Σi=1m Σj=1n (hΘ(x(i)j) - y(i))2)/(2*m)

And so will the gradient descent:

repeat until convergence {

Θ0 = Θ0 - α*(1/m)* (Σi=1m (hΘ(x(i)) - y(i))), for j = 0

Θ1 = Θ1 - α*(1/m)* (Σi=1m (hΘ(x(i)1) - y(i)) * (x(i)1)), for j = 1

Θ2 = Θ2 - α*(1/m)* (Σi=1m (hΘ(x(i)2) - y(i)) * (x(i)2)), for j = 2

Θn = Θn - α*(1/m)* (Σi=1m (hΘ(x(i)n) - y(i)) * (x(i)n)), for j = n

}

or, re-written simply:

repeat until convergence {

Θj = Θj - α*(1/m)* (Σi=1m (hΘ(x(i)j) - y(i)) * (x(i)j)), for j = 0 to n

}

As always, remember that the above updates to all Θj (for j = 0 to n) should happen simultaneously for a correct implementation (see here)

Once we compute our hypothesis function using the above implementation, it is then trivial to predict "y" for any given input set of features

y = h(x) = Σj=0n Θj xj

That's it for today. In the next post we will learn how to implement these algorithms using "vectorization" for faster performance. So stay tuned!