Gradient of Softmax + Cross-Entropy w.r.t Logits
Author: Bindeshwar Singh Kushwaha – PostNetwork Academy
Goal
We want to compute:
$$ \frac{\partial L}{\partial z_j} $$
Notation:
- Logits: \(z = [z_1, z_2, \dots, z_C]\)
- Softmax: \(\hat{y}_i = \frac{e^{z_i}}{\sum_{k=1}^{C} e^{z_k}}\)
- Cross-Entropy Loss: \(L = -\sum_{i=1}^{C} y_i \log \hat{y}_i\), where \(y_i\) is one-hot.
[Insert Neural Network Diagram Here]
Loss derivative w.r.t Softmax output
From the loss definition:
$$ \frac{\partial L}{\partial \hat{y}_i} = – \frac{y_i}{\hat{y}_i} $$
Explanation: This comes from derivative of \(\log x = 1/x\) and the negative sign.
[Insert Neural Network Diagram Here]
Softmax derivative as a Jacobian
The derivative of softmax is not a vector, but a matrix:
$$
\frac{\partial \hat{y}}{\partial z} =
\begin{bmatrix}
\frac{\partial \hat{y}_1}{\partial z_1} & \dots & \frac{\partial \hat{y}_1}{\partial z_C} \\
\vdots & \ddots & \vdots \\
\frac{\partial \hat{y}_C}{\partial z_1} & \dots & \frac{\partial \hat{y}_C}{\partial z_C}
\end{bmatrix}
$$
- Diagonal elements (\(i=j\)) show self-influence.
- Off-diagonal (\(i\neq j\)) show competition across classes.
Case 1: \(i=j\)
$$
\frac{\partial \hat{y}_i}{\partial z_i}
= \hat{y}_i (1 – \hat{y}_i)
$$
[Insert Neural Network Diagram Here]
Case 2: \(i \neq j\)
$$
\frac{\partial \hat{y}_i}{\partial z_j} = -\hat{y}_i \hat{y}_j
$$
Interpretation: Increasing \(z_j\) decreases \(\hat{y}_i\) for \(i \neq j\).
Compact Softmax Derivative
$$
\frac{\partial \hat{y}_i}{\partial z_j} =
\begin{cases}
\hat{y}_i(1-\hat{y}_i), & i=j \\
-\hat{y}_i \hat{y}_j, & i \neq j
\end{cases}
$$
Chain Rule
$$
\frac{\partial L}{\partial z_j} = \sum_{i=1}^{C} \frac{\partial L}{\partial \hat{y}_i} \frac{\partial \hat{y}_i}{\partial z_j}
$$
Gradient Simplification
$$
\frac{\partial L}{\partial z_j}
= -y_j(1 – \hat{y}_j) + \sum_{i \neq j} y_i \hat{y}_j
$$
Since only one \(y_i = 1\):
$$
\frac{\partial L}{\partial z_j} = \hat{y}_j – y_j
$$
Interpretation: Gradient = Predicted probability – True label
Final Result
$$
\boxed{\frac{\partial L}{\partial z} = \hat{y} – y}
$$
- The gradient is simply predicted minus target.
- No need to compute full Jacobian in practice.
- Efficient for classification tasks.
Video
About PostNetwork Academy
- Website: www.postnetwork.co
- YouTube: www.youtube.com/@postnetworkacademy
- Facebook: www.facebook.com/postnetworkacademy
- LinkedIn: www.linkedin.com/company/postnetworkacademy
- GitHub: www.github.com/postnetworkacademy
