Python – Logistic Regression function on sklearn

machine learningnumpypythonscikit-learnscipy

I am learning Logistic Regression from sklearn and came across this : http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

I have a created an implementation which shows me the accuracy scores for training and testing. However it is very unclear how this was achieved. My question is : What is the Maximum likelihood estimate? How is this being calculated? What is the error measure? What is the optimisation algorithm used?

I know all of the above in theory, however I am not sure where and when and how scikit.learn calculates it, or if its something I need to implement at some point. I have an accuracy rate of 83% which was what I was aiming for but I am very confused about how this was achieved by scikit learn.

Would anyone be able to point me in the right direction?

Best Answer

I recently started studying LR myself, I still don't get many steps of the derivation but I think I understand which formulas are being used.

First of all let's assume that you are using the latest version of scikit-learn and that the solver being used is solver='lbfgs' (which is the default I believe).

The code is here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py

What is the Maximum likelihood estimate? How is this being calculated?

The function to compute the likelihood estimate is this one https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L57

The interesting line is:

# Logistic loss is the negative of the log of the logistic function.
out = -np.sum(sample_weight * log_logistic(yz)) + .5 * alpha * np.dot(w, w)

which is the formula 7 of this tutorial. The function also computes the gradient of the likelihood, which is then passed to the minimization function (see below). One important thing is that the intercept is w0 of the formulas in the tutorial. But that's only valid fit_intercept is True.

What is the error measure?

I'm sorry I'm not sure.

What is the optimisation algorithm used?

See the following lines in the code: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L389

It's this function http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_l_bfgs_b.html

One very important thing is that the classes are +1 or -1! (for the binary case, in the literature 0 and 1 are common, but it won't work)

Also notice that numpy broadcasting rules are used in all formulas. (That's why you don't see iteration)

This was my attempt at understanding the code. I slowly went mad till the point of ripping appart scikit-learn code (in only works for the binary case). Also this served as inspiration too

Hope it helps.

Related Topic