February 18, 2014

ml-505: Simplified Cost Function and Gradient Descent for Logistic Regression

Hello, welcome back! In the last post we defined the cost function for our logistic regression implementation. Today we will simplify it, and also move onto the next stage, viz minimization of the cost function.

As a short recap, in the last post, we defined the cost function as

cost(hθ(x), y) = -log(hθ(x)), if y = 1

-log(1 = hθ(x)), if y = 0

The same can be rewritten simply as follows:

cost(hθ(x), y) = -y log(hθ(x)) - (1 - y) log(1 - hθ(x))

To see how this is the same as the previous one, we need to remember that our output variable "y" can take only 1 of 2 possible values, viz "0" and "1". So let's try to substitute these values in our 2nd equation to see if we can get to the 1st

When "y = 1"

cost(hθ(x), y) = -1 log(hθ(x)) - (1 - 1) log(1 - hθ(x))

= -log(hθ(x)) - 0 * log(1 - hθ(x))

= -log(hθ(x))

Similarly, when "y = 0"

cost(hθ(x), y) = -0 * log(hθ(x)) - (1 - 0) log(1 - hθ(x))

= -0 - 1 * log(1 - hθ(x))

= -log(1 - hθ(x))

Combining the above 2 new equations, we get

cost(hθ(x), y) = -log(hθ(x)), when "y = 1"

= -log(1 - hθ(x)), when "y = 0"

which is exactly the same as our original cost function as defined in the 1st equation.

Using this new definition, our real cost function J(θ) becomes:

J(θ) = -(y log(hθ(x)) + (1 - y) log(1 - hθ(x))) / m

With that out of the way, let us move onto the next step, viz minimization. For this, we will use our old friend which was introduced in these earlier posts here and here, viz gradient descent. I hope you remember that the general template for gradient descent is:

repeat until convergence {

θj = θj - α * δ/δθj J(θ)

update all θj simultaneously

}

Without going into the mathematical derivation, the partial derivative of J(θ) w.r.t θj is:

δ/δθj J(θ) = Σi=1m (hθ(x(i)) - y(i)) xj(i) / m

With this information, our gradient descent step becomes

repeat until convergence {

θj = θj - α * Σi=1m (hθ(x(i)) - y(i)) xj(i) / m

update all θj simultaneously

}

You might notice that this gradient descent definition is exactly the same as that for linear regression as defined here. Does this mean that both instances of the gradient descent implementations (for linear and logistic regression) are the same? Close inspection will show that this is not the case. That is because the hypothesis function (htheta(x)) for linear and logistic are different. Thus the two gradient descent too are different, although they look the same superficially.

Once we have the optimized values of theta, it is trivial to substitute them in the definition for the hypothesis function which can be used for prediction of new values of y, given x. However, do remember, that the hypothesis function for logistic regression is actually a probability function (as mentioned in this post). Thus

hypothesis function = hθ(x) = p(y = 1|x; θ)

= probability that "y = 1" given "x" and parameterized by "θ"

Lastly, we had described a method for verifying that the gradient descent implementation is indeed working correctly and is minimizing the cost function in this post. Although that method was used for linear regression, the exact same method can be used for logistic regression too.

This almost completes the logistic regression part of ML algorithms. The only thing left is to look at cases when y can take a value from a predefined set of values, instead of just 2 values "0" and "1"; that is multi-class classification instead of binary classification. We will look at this last detail in the next post. So stay tuned!