[This is to show how the softmax arises naturally in the multiclass classification problem]
Ref: "Machine Learning Refined" by Watt, Borhani, Katsaggelos.
Consider data ${ \lbrace (\mathbf{x} _p, y _p) \rbrace _{p = 1} ^{P} }$ where each ${ \mathbf{x} _p \in \mathbb{R} ^N }$ and each ${ y _p \in \lbrace 0, 1, \ldots, C -1 \rbrace . }$
Let
$${ \overset{\circ}{\mathbf{x}} _p = (1, x _{1, p}, \ldots, x _{N, p}) ^T . }$$
Consider linear classifiers
$${ \mathbf{w} _0, \mathbf{w} _1, \ldots, \mathbf{w} _{C - 1} \in \mathbb{R} ^{N + 1} . }$$
Let
$${ b _j = w _{0, j}, \quad \boldsymbol{\omega} _j = (w _{1, j}, \ldots, w _{N, j}) . }$$
Consider the data point ${ (\mathbf{x} _p, y _p) . }$
Consider signed distances from the decision boundaries
$${ \frac{\mathbf{w} _j ^T \overset{\circ}{\mathbf{x}} _p}{\lVert \boldsymbol{\omega} _j \rVert} }$$
We want to pick ${ \mathbf{w} _0, \ldots, \mathbf{w} _{C - 1} }$ such that the label ${ y _p }$ can be predicted as the argmax of the above distances.
We want to pick ${ \mathbf{w} _0, \ldots, \mathbf{w} _{C -1} }$ such that
$${ \frac{\mathbf{w} _{y _p} ^T \overset{\circ}{\mathbf{x}} _p}{\lVert \boldsymbol{\omega} _{y _p} \rVert} \approx \max _{j = 0, \ldots, C - 1} \frac{\mathbf{w} _j ^T \overset{\circ}{\mathbf{x}} _p}{\lVert \boldsymbol{\omega} _j \rVert} \quad \text{ for } \, \, p = 1, \ldots, P . }$$
Hence consider the cost function
$${ {\begin{aligned} &\, g _{\text{init}} (\mathbf{w} _0, \ldots, \mathbf{w} _{C -1}) \\ = &\, \frac{1}{P} \sum _{p = 1} ^{P} \left[ \left( \max _{j = 0, \ldots, C - 1} \frac{\mathbf{w} _j ^T \overset{\circ}{\mathbf{x}} _p}{\lVert \boldsymbol{\omega} _j \rVert} \right) - \frac{\mathbf{w} _{y _p} ^T \overset{\circ}{\mathbf{x}} _p}{\lVert \boldsymbol{\omega} _{y _p} \rVert} \right] . \end{aligned}} }$$
Let
$${ \hat{\mathbf{w}} _j = \frac{\mathbf{w} _j}{\lVert \boldsymbol{\omega} _j \rVert} . }$$
Note that the cost function is
$${ {\begin{aligned} &\, g _{\text{init}} (\mathbf{w} _0, \ldots, \mathbf{w} _{C -1}) \\ = &\, \frac{1}{P} \sum _{p = 1} ^{P} \left[ \left( \max _{j = 0, \ldots, C - 1} \hat{\mathbf{w}} _j ^T \overset{\circ}{\mathbf{x}} _p \right) - \hat{\mathbf{w}} _{y _p} ^T \overset{\circ}{\mathbf{x}} _p \right] . \end{aligned}} }$$
Note that the max function can be smoothly approximated by the LogSumExp function. Link to Stackexchange post: Link.
Hence consider the cost function
$${ {\begin{aligned} &\, g (\mathbf{w} _0, \ldots, \mathbf{w} _{C -1}) \\ = &\, \frac{1}{P} \sum _{p = 1} ^{P} \left[ \log \left( \sum _{j = 0} ^{C - 1} \exp \left( \hat{\mathbf{w}} _j ^T \overset{\circ}{\mathbf{x}} _p \right) \right) - \hat{\mathbf{w}} _{y _p} ^T \overset{\circ}{\mathbf{x}} _p \right] \\ = &\, - \frac{1}{P} \sum _{p = 1} ^{P} \log \left(\frac{\exp \left( \hat{\mathbf{w}} _{y _p} ^T \overset{\circ}{\mathbf{x}} _p \right) }{\sum _{j = 0} ^{C - 1} \exp \left( \hat{\mathbf{w}} _j ^T \overset{\circ}{\mathbf{x}} _p \right)} \right) . \end{aligned}} }$$
Note that minimizing the cost can be viewed as maximizing the likelihood
$${ {\begin{aligned} &\, \prod _{p = 1} ^{P} {\color{blue}{p(y = y _p \vert \overset{\circ}{\mathbf{x}} _p, \mathbf{w} _0, \ldots, \mathbf{w} _{C-1})}} \\ = &\, \prod _{p = 1} ^{P} {\color{blue}{\frac{\exp \left( \hat{\mathbf{w}} _{y _p} ^T \overset{\circ}{\mathbf{x}} _p \right) }{\sum _{j = 0} ^{C - 1} \exp \left( \hat{\mathbf{w}} _j ^T \overset{\circ}{\mathbf{x}} _p \right)}}} . \end{aligned}} }$$
Hence we can consider the model generating the labels to be
$${ \boxed{ {\begin{aligned} &\, (y \vert \overset{\circ}{\mathbf{x}} , \mathbf{w} _0, \ldots, \mathbf{w} _{C-1}) \text{ is } \text{Categorical}(0, \ldots, K -1) , \\ &\, p(y = y _j \vert \overset{\circ}{\mathbf{x}} , \mathbf{w} _0, \ldots, \mathbf{w} _{C-1}) = \frac{\exp \left( \hat{\mathbf{w}} _{y _j} ^T \overset{\circ}{\mathbf{x}} \right) }{\sum _{j = 0} ^{C - 1} \exp \left( \hat{\mathbf{w}} _j ^T \overset{\circ}{\mathbf{x}} \right)} \end{aligned}} } }$$
and perform maximum likelihood estimation, as needed.