Softmax vs Sigmoid function in Logistic classifier?

Question

What decides the choice of function ( Softmax vs Sigmoid ) in a Logistic classifier ?

Suppose there are 4 output classes . Each of the above function gives the probabilities of each class being the correct output . So which one to take for a classifier ?

The softmax function is nothing but a generalization of the sigmoid, so it's not entirely clear what you mean by "softmax vs. sigmoid." — dsaxton
– dsaxton, Commented Sep 6, 2016 at 15:56
It is the case with the sigmoid. When we use the sigmoid one class has probability $\exp(\beta^T x) / (\exp(\beta^T x) + 1)$ and the other has probability $1 / (\exp(\beta^T x) + 1)$. — dsaxton
– dsaxton, Commented Sep 6, 2016 at 16:05
The reddit poster is making a distinction that I think is wrong or at least irrelevant. Whether or not one of the classes has weight one is just a matter of shifting the scores, which has no effect on the probabilities. — dsaxton
– dsaxton, Commented Sep 6, 2016 at 19:03
Possible duplicate of Binary and multinomial logistic regression — Franck Dernoncourt
– Franck Dernoncourt, Commented Jan 2, 2017 at 14:36
"it's not entirely clear what you mean by "softmax vs. sigmoid."" just below the title, there's the body of the question -- very easy to miss, I know. Plus, it's a good title to direct google queries to come here to answer exactly what was asked. — michael
– michael, Commented Oct 31, 2017 at 4:54

KamKam · Accepted Answer · 2020-03-24 17:21:23Z

The sigmoid function is used for the two-class logistic regression, whereas the softmax function is used for the multiclass logistic regression (a.k.a. MaxEnt, multinomial logistic regression, softmax Regression, Maximum Entropy Classifier).

In the two-class logistic regression, the predicted probablies are as follows, using the sigmoid function:

$$ \begin{align} \Pr(Y_i=0) &= \frac{e^{-\boldsymbol\beta \cdot \mathbf{X}_i}} {1 +e^{-\boldsymbol\beta \cdot \mathbf{X}_i}} \, \\ \Pr(Y_i=1) &= 1 - \Pr(Y_i=0) = \frac{1} {1 +e^{-\boldsymbol\beta \cdot \mathbf{X}_i}} \end{align} $$

In the multiclass logistic regression, with $K$ classes, the predicted probabilities are as follows, using the softmax function:

$$ \begin{align} \Pr(Y_i=k) &= \frac{e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}} {~\sum_{0 \leq c \leq K}^{}{e^{\boldsymbol\beta_c \cdot \mathbf{X}_i}}} \, \\ \end{align} $$

One can observe that the softmax function is an extension of the sigmoid function to the multiclass case, as explained below. Let's look at the multiclass logistic regression, with $K=2$ classes:

$$ \begin{align} \Pr(Y_i=0) &= \frac{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i}} {~\sum_{0 \leq c \leq K}^{}{e^{\boldsymbol\beta_c \cdot \mathbf{X}_i}}} = \frac{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}} = \frac{e^{(\boldsymbol\beta_0 - \boldsymbol\beta_1) \cdot \mathbf{X}_i}}{e^{(\boldsymbol\beta_0 - \boldsymbol\beta_1) \cdot \mathbf{X}_i} + 1} = \frac{e^{-\boldsymbol\beta \cdot \mathbf{X}_i}} {1 +e^{-\boldsymbol\beta \cdot \mathbf{X}_i}} \\ \, \\ \Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}} {~\sum_{0 \leq c \leq K}^{}{e^{\boldsymbol\beta_c \cdot \mathbf{X}_i}}} = \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}}{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}} = \frac{1}{e^{(\boldsymbol\beta_0-\boldsymbol\beta_1) \cdot \mathbf{X}_i} + 1} = \frac{1} {1 +e^{-\boldsymbol\beta \cdot \mathbf{X}_i}} \, \\ \end{align} $$

with $\boldsymbol\beta = - (\boldsymbol\beta_0 - \boldsymbol\beta_1)$. We see that we obtain the same probabilities as in the two-class logistic regression using the sigmoid function. Wikipedia expands a bit more on that.

I am naive in this one, But I see this a lot of time β=−(β0−β1) What could be possible explanation to it? As far as I know in Sigmoids β would be a vector. And they are usually one for given run. Then how come β0 and β1 comes in the picture? — Ishan Bhatt
– Ishan Bhatt, Commented Jan 18, 2018 at 7:25
strangely enough, i can still regress to multiclasses using just sigmoid :) — Dan D
– Dan D, Commented Sep 17, 2019 at 8:50
@aan There is no reason why Sigmoid cannot be used in a CNN setup. Keep in mind that 'Convolutional Neural Network' means that your network has convolutional layers, perhaps to get a representation for an image input. This is different from your downstream task (e.g. image classification), for which you will perhaps need linear layers and probabilities, for which Sigmoid can be used (in the 2 class case). — gtoques
– gtoques, Commented Feb 4, 2022 at 1:23

veritessa · Accepted Answer · 2019-05-26 02:58:02Z

I've noticed people often get directed to this question when searching whether to use sigmoid vs softmax in neural networks. If you are one of those people building a neural network classifier, here is how to decide whether to apply sigmoid or softmax to the raw output values from your network:

If you have a multi-label classification problem = there is more than one "right answer" = the outputs are NOT mutually exclusive, then use a sigmoid function on each raw output independently. The sigmoid will allow you to have high probability for all of your classes, some of them, or none of them. Example: classifying diseases in a chest x-ray image. The image might contain pneumonia, emphysema, and/or cancer, or none of those findings.
If you have a multi-class classification problem = there is only one "right answer" = the outputs are mutually exclusive, then use a softmax function. The softmax will enforce that the sum of the probabilities of your output classes are equal to one, so in order to increase the probability of a particular class, your model must correspondingly decrease the probability of at least one of the other classes. Example: classifying images from the MNIST data set of handwritten digits. A single picture of a digit has only one true identity - the picture cannot be a 7 and an 8 at the same time.

Reference: for a more detailed explanation of when to use sigmoid vs. softmax in neural network design, including example calculations, please see this article: "Classification: Sigmoid vs. Softmax."

Does sigmoid (binary classification) only for classify 2 classes, e.g. Cat and Dog and softmax function(multi-class) for classify more than 2 classes, e.g. Cat, Dog, Bird? — aan
– aan, Commented Aug 22, 2020 at 11:07
@aan No, the number of possible answers doesn't matter. The question to ask for your first example is "Will there be any pictures that have BOTH cat and dog in the same picture?" If the answer is yes (or unknown), use sigmoid. If the answer is no, use softmax. — beauxq
– beauxq, Commented Sep 23, 2020 at 0:46
@beauxq Do you mean if Both cat and dog in the same 1 picture, answer is yes or unknown, use sigmoid. If the answer is no, use softmax? — aan
– aan, Commented Sep 28, 2020 at 15:50
How would you use the sigmoid for multi-label in practice? Say you have 5 classes. I assume that when you use sigmoid you calculate sigmoid(N_i) for all $i\in{1,2,3,4,5}$ neurons in the last layer, whereas we, when using softmax, just apply it to all of the 5 neurons (and get the score for each class)? — CutePoison
– CutePoison, Commented Dec 9, 2021 at 11:42

learner · Accepted Answer · 2020-12-31 11:41:53Z

They are, in fact, equivalent, in the sense that one can be transformed into the other.

Suppose that your data is represented by a vector $\boldsymbol{x}$, of arbitrary dimension, and you built a binary classifier $P$ for it, using an affine transformation followed by a softmax:

\begin{equation} \begin{pmatrix} z_0 \\ z_1 \end{pmatrix} = \begin{pmatrix} \boldsymbol{w}_0^T \\ \boldsymbol{w}_1^T \end{pmatrix}\boldsymbol{x} + \begin{pmatrix} b_0 \\ b_1 \end{pmatrix}, \end{equation} \begin{equation} P(C_i | \boldsymbol{x}) = \text{softmax}(z_i)=\frac{e^{z_i}}{e^{z_0}+e^{z_1}}, \, \, i \in \{0,1\}. \end{equation}

Let's transform it into an equivalent binary classifier $P^*$ that uses a sigmoid instead of the softmax. First of all, we have to decide which is the probability that we want the sigmoid to output (which can be for class $C_0$ or $C_1$). This choice is absolutely arbitrary and so I choose class $C_1$. Then, my classifier will be of the form:

\begin{equation} z' = \boldsymbol{w}'^T \boldsymbol{x} + b', \end{equation} \begin{equation} P^*(C_1 | \boldsymbol{x}) = \sigma(z')=\frac{1}{1+e^{-z'}}, \end{equation} \begin{equation} P^*(C_0 | \boldsymbol{x}) = 1-\sigma(z'). \end{equation}

The classifiers are equivalent if the probabilities are the same for all $\boldsymbol{x}$, so we must impose:

\begin{equation} P^*(C_i|\boldsymbol{x})=P(C_i|\boldsymbol{x}) \quad i \in \{0,1\},\; \forall \boldsymbol{x}, \end{equation} or, equivalently, $\sigma(z') = \text{softmax}(z_1)$ for all $\boldsymbol{x}$. Now, replacing $z_0$, $z_1$, and $z'$ by their expressions in terms of $\boldsymbol{w}_0,\boldsymbol{w}_1, \boldsymbol{w}', b_0, b_1, b'$, and $\boldsymbol{x}$ and doing some straightforward algebraic manipulation, you may verify that the equality above holds if and only if $\boldsymbol{w}'$ and $b'$ are given by:

\begin{equation} \boldsymbol{w}' = \boldsymbol{w}_1-\boldsymbol{w}_0, \end{equation} \begin{equation} b' = b_1-b_0. \end{equation}

This shows that your first classifier $P$ (i.e., the one using the softmax) had more parameters than needed. This is true also for multiclass classification and it poses difficulties to optimization. An effective solution is to set the parameters for one of the classes to a fixed value (e.g., set $\boldsymbol{w}_0 = 0$ and $b_0=0$) and optimize only the remaining parameters.

@null Ok, I if you ask that, then you did not understand my explanation. Let me address your specific problem: if you tell me you are feeding your data to a sigmoid, then it must be a one-dimensional number, $x$. When feeding it to a sigmoid, you get the probability of $x$ being in one of your two classes, for instance $C_0$: $P(C_0|x)=σ(x)$. Then, the probability of $x$ being in $C_1$ is: $P(C_1|x)=1−P(C_0|x)=σ(x)$. Now let's replace your sigmoid by a softmax. (To be continued). — learner
– learner, Commented Jun 25, 2017 at 23:49
(Continuation). In order to apply a softmax to a classification problem with two classes, you need your one dimensional data to be transformed into a two dimensional vector. Therefore, we need to define our $w_0$ and $w_1$. Let's choose $w_0=1$. Since $w_1$ must satisfy $w′=w_0−w_1$, we have $1=1−w_1$, so $w_1=0$. Now, we have $z_0=w_0x=x$ and $z_1=w_1x=0$. Using this, you can immediately verify that $σ(x)=\text{softmax}(z_0)$. — learner
– learner, Commented Jun 25, 2017 at 23:49
Moreover, any combination of $w_0$ and $w_1$ that satisfies $w'=w_0-w_1$ (that is, $1=w_1-w_0$) would lead to the exact same result. This shows that the softmax has one redundant parameter. Although this may seem stupid, it is in fact an interesting property, since it allows normalization of the parameters $w_i$, which promotes numerical stability of the learning algorithm and inference. But this is just an extra comment, it is not important to answer your question :) — learner
– learner, Commented Jun 25, 2017 at 23:50
Thanks a lot. I got it. In your first comment the probability $P(C_1|x)$ should probably be $1-\sigma(x)$. I now understand what is the idea behind the transformation. — null
– null, Commented Jun 27, 2017 at 11:05
Glad that you understood it ;) Yes, it's a typo, it obviously should be $P(C_1|x)=1 - \sigma(x)$. Thanks for pointing it out! — learner
– learner, Commented Jun 27, 2017 at 11:19

Maverick Meerkat · Accepted Answer · 2019-09-29 15:14:12Z

Adding to all the previous answers - I would like to mention the fact that any multi-class classification problem can be reduced to multiple binary classification problems using "one-vs-all" method, i.e. having C sigmoids (when C is the number of classes) and interpreting every sigmoid to be the probability of being in that specific class or not, and taking the max probability.

So for example, in the MNIST digits example, you could either use a softmax, or ten sigmoids. In fact this is what Andrew Ng does in his Coursera ML course. You can check out here how Andrew Ng used 10 sigmoids for multiclass classification (adapted from Matlab to python by me), and here is my softmax adaptation in python.

Also, it's worth noting that while the functions are equivalent (for the purpose of multiclass classification) they differ a bit in their implementation (especially with regards to their derivatives, and how to represent y).

A big advantage of using multiple binary classifications (i.e. Sigmoids) over a single multiclass classification (i.e. Softmax) - is that if your softmax is too large (e.g. if you are using a one-hot word embedding of a dictionary size of 10K or more) - it can be inefficient to train it. What you can do instead is take a small part of your training-set and use it to train only a small part of your sigmoids. This is the main idea behind Negative Sampling.

The functions are not equivalent because the softmax network is constrained to produce a probability distribution over the classes as the outputs: the vector is non-negative and sums to 1. The $C$ sigmoid units are non-negative, but they can sum to any number between 0 and $C$; it is not a valid probability distribution. This distinction is crucial to characterizing how the two functions differ. — Sycorax
– Sycorax ♦, Commented Sep 23, 2019 at 20:59
What is your definition of equivalent? Mine is: you can use either for multiclass classification without any problem. Also - any multiclass classification that uses softmax can be transformed to a one-vs-all binary classifications that use sigmoids. Why should I care about the distributions of the outputs summing to 1? — Maverick Meerkat
– Maverick Meerkat, Commented Sep 24, 2019 at 12:06
Your argument about multi-label classification shows why sigmoid and softmax are not equivalent. When using softmax, increasing the probability of one class decreases the total probability of all other classes (because of sum-to-1). Using sigmoid, increasing the probability of one class does not change the total probability of the other classes. This observation is the reason that sigmoid is plausible for multi-label classification: a single example can belong to $0, 1, 2, \dots , C$ classes. Sum-to-1 is also the reason that softmax is not suitable for multi-label classification. — Sycorax
– Sycorax ♦, Commented Sep 24, 2019 at 13:00
I lost you. For all practical purposes that I know of, multiple sigmoids = 1 softmax. I even added the case of negative sampling, where multiple sigmoids actually have an advantage over a softmax. — Maverick Meerkat
– Maverick Meerkat, Commented Sep 29, 2019 at 15:15

Venkata Karthik Bandaru · Accepted Answer · 2025-11-12 04:51:20Z

[This is to show how the softmax arises naturally in the multiclass classification problem]

Ref: "Machine Learning Refined" by Watt, Borhani, Katsaggelos.

Consider data ${ \lbrace (\mathbf{x} _p, y _p) \rbrace _{p = 1} ^{P} }$ where each ${ \mathbf{x} _p \in \mathbb{R} ^N }$ and each ${ y _p \in \lbrace 0, 1, \ldots, C -1 \rbrace . }$

Let

$${ \overset{\circ}{\mathbf{x}} _p = (1, x _{1, p}, \ldots, x _{N, p}) ^T . }$$

Consider linear classifiers

$${ \mathbf{w} _0, \mathbf{w} _1, \ldots, \mathbf{w} _{C - 1} \in \mathbb{R} ^{N + 1} . }$$

Let

$${ b _j = w _{0, j}, \quad \boldsymbol{\omega} _j = (w _{1, j}, \ldots, w _{N, j}) . }$$

Consider the data point ${ (\mathbf{x} _p, y _p) . }$

Consider signed distances from the decision boundaries

$${ \frac{\mathbf{w} _j ^T \overset{\circ}{\mathbf{x}} _p}{\lVert \boldsymbol{\omega} _j \rVert} }$$

We want to pick ${ \mathbf{w} _0, \ldots, \mathbf{w} _{C - 1} }$ such that the label ${ y _p }$ can be predicted as the argmax of the above distances.

We want to pick ${ \mathbf{w} _0, \ldots, \mathbf{w} _{C -1} }$ such that

$${ \frac{\mathbf{w} _{y _p} ^T \overset{\circ}{\mathbf{x}} _p}{\lVert \boldsymbol{\omega} _{y _p} \rVert} \approx \max _{j = 0, \ldots, C - 1} \frac{\mathbf{w} _j ^T \overset{\circ}{\mathbf{x}} _p}{\lVert \boldsymbol{\omega} _j \rVert} \quad \text{ for } \, \, p = 1, \ldots, P . }$$

Hence consider the cost function

$${ {\begin{aligned} &\, g _{\text{init}} (\mathbf{w} _0, \ldots, \mathbf{w} _{C -1}) \\ = &\, \frac{1}{P} \sum _{p = 1} ^{P} \left[ \left( \max _{j = 0, \ldots, C - 1} \frac{\mathbf{w} _j ^T \overset{\circ}{\mathbf{x}} _p}{\lVert \boldsymbol{\omega} _j \rVert} \right) - \frac{\mathbf{w} _{y _p} ^T \overset{\circ}{\mathbf{x}} _p}{\lVert \boldsymbol{\omega} _{y _p} \rVert} \right] . \end{aligned}} }$$

Let

$${ \hat{\mathbf{w}} _j = \frac{\mathbf{w} _j}{\lVert \boldsymbol{\omega} _j \rVert} . }$$

Note that the cost function is

$${ {\begin{aligned} &\, g _{\text{init}} (\mathbf{w} _0, \ldots, \mathbf{w} _{C -1}) \\ = &\, \frac{1}{P} \sum _{p = 1} ^{P} \left[ \left( \max _{j = 0, \ldots, C - 1} \hat{\mathbf{w}} _j ^T \overset{\circ}{\mathbf{x}} _p \right) - \hat{\mathbf{w}} _{y _p} ^T \overset{\circ}{\mathbf{x}} _p \right] . \end{aligned}} }$$

Note that the max function can be smoothly approximated by the LogSumExp function. Link to Stackexchange post: Link.

Hence consider the cost function

$${ {\begin{aligned} &\, g (\mathbf{w} _0, \ldots, \mathbf{w} _{C -1}) \\ = &\, \frac{1}{P} \sum _{p = 1} ^{P} \left[ \log \left( \sum _{j = 0} ^{C - 1} \exp \left( \hat{\mathbf{w}} _j ^T \overset{\circ}{\mathbf{x}} _p \right) \right) - \hat{\mathbf{w}} _{y _p} ^T \overset{\circ}{\mathbf{x}} _p \right] \\ = &\, - \frac{1}{P} \sum _{p = 1} ^{P} \log \left(\frac{\exp \left( \hat{\mathbf{w}} _{y _p} ^T \overset{\circ}{\mathbf{x}} _p \right) }{\sum _{j = 0} ^{C - 1} \exp \left( \hat{\mathbf{w}} _j ^T \overset{\circ}{\mathbf{x}} _p \right)} \right) . \end{aligned}} }$$

Note that minimizing the cost can be viewed as maximizing the likelihood

$${ {\begin{aligned} &\, \prod _{p = 1} ^{P} {\color{blue}{p(y = y _p \vert \overset{\circ}{\mathbf{x}} _p, \mathbf{w} _0, \ldots, \mathbf{w} _{C-1})}} \\ = &\, \prod _{p = 1} ^{P} {\color{blue}{\frac{\exp \left( \hat{\mathbf{w}} _{y _p} ^T \overset{\circ}{\mathbf{x}} _p \right) }{\sum _{j = 0} ^{C - 1} \exp \left( \hat{\mathbf{w}} _j ^T \overset{\circ}{\mathbf{x}} _p \right)}}} . \end{aligned}} }$$

Hence we can consider the model generating the labels to be

$${ \boxed{ {\begin{aligned} &\, (y \vert \overset{\circ}{\mathbf{x}} , \mathbf{w} _0, \ldots, \mathbf{w} _{C-1}) \text{ is } \text{Categorical}(0, \ldots, K -1) , \\ &\, p(y = y _j \vert \overset{\circ}{\mathbf{x}} , \mathbf{w} _0, \ldots, \mathbf{w} _{C-1}) = \frac{\exp \left( \hat{\mathbf{w}} _{y _j} ^T \overset{\circ}{\mathbf{x}} \right) }{\sum _{j = 0} ^{C - 1} \exp \left( \hat{\mathbf{w}} _j ^T \overset{\circ}{\mathbf{x}} \right)} \end{aligned}} } }$$

and perform maximum likelihood estimation, as needed.

Stack Exchange Network

Softmax vs Sigmoid function in Logistic classifier?

5 Answers 5

Linked

Hot Network Questions

Softmax vs Sigmoid function in Logistic classifier?

5 Answers 5

Linked

Related

Hot Network Questions