0
$\begingroup$

I am trying to understand the math behind logistic regression. Going through a couple of websites, lectures and books, I tried to derive the cost function by thinking of it as the negative of the maximum likelihood. My derivation matches the cost function shown in this Wikipedia page https://en.wikipedia.org/wiki/Logistic_regression and in other places.

If the inputs are $x^{(i)}$ and outputs are $y^{(i)}$, where $(i)$ refers to the $i$th data point, then the cost as a function of weights $w$ seems to be

$$-\sum_{i=1}^m y^{(i)} \log\left(\frac{1}{1 + e^{-w^Tx^{(i)}}}\right)+\left(1-y^{(i)} \right) \log \left( 1-\frac{1}{1 + e^{-w^Tx^{(i)}}} \right) $$

I can simplify further to $$\sum_{i=1}^m -w^Tx^{(i)} y^{(i)} +\log⁡ (1+e^{-w^Tx^{(i)}})$$ However the expression shown in the SciKit Learn guide https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression is $$\sum_{i=1}^m \log ( 1 + e^{-w^Tx^{(i)}y^{(i)}}) $$

I have tried some algebra and am not able to derive their formulation. Am I missing something? It is highly possible that I haven't tried all the tricks there are in simplifying

$\endgroup$
1
  • $\begingroup$ Important note: scikit-learn's implementation of logistic regression uses regularization by default. $\endgroup$ Commented Feb 21, 2021 at 7:51

1 Answer 1

1
$\begingroup$

From the same scikit-learn link, note that, in this notation, it’s assumed that the target $y^{(i)}$ takes values $\{-1,+1\}$ in the set at trial $i$.

To the contrary, $y^{(i)} \in \{0,1\}$ as per your initial definition of the binary cross entropy (BCE) cost function.

In the scikit-learn notation, we have

$$P(y^{(i)}=+1\mid x^{(i)})=\sigma(w^Tx^{(i)})=\frac{1}{1+e^{-w^Tx^{(i)}}}$$

$$P(y^{(i)}=-1\mid x^{(i)})=1-\sigma(w^Tx^{(i)})=\frac{1}{1+e^{w^Tx^{(i)}}},$$

so that in both cases we have

$$P(y^{(i)}\mid x^{(i)})=\frac{1}{1+e^{-w^Tx^{(i)}y^{(i)}}}$$

With independence assumption, the likelihood is $\prod\limits_{i=1}^m \frac{1}{1+e^{-w^Tx^{(i)}y^{(i)}}}$

The negative log-likelihood is $\sum\limits_{i=1}^m \log(1+e^{-w^T x^{(i)} y^{(i)}})$, which is the cost function to be minimized (for MLE), along with some ($L_2$) regularization term.

$\endgroup$
2
  • $\begingroup$ Nice! Thanks. I don't think my original derivation (not really mine, courtesy Wikipedia) makes any assumptions about the actual values for y. This is good to know about sci-kit learn $\endgroup$ Commented Feb 20, 2021 at 22:31
  • 1
    $\begingroup$ Never mind, I see where the assumption of y in {0,1} is made in the derivation. $\endgroup$ Commented Feb 20, 2021 at 22:36

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.