Gradient of Softmax + Cross-Entropy w.r.t Logits



Gradient of Softmax + Cross-Entropy w.r.t Logits

Author: Bindeshwar Singh Kushwaha – PostNetwork Academy

Goal

We want to compute:

$$ \frac{\partial L}{\partial z_j} $$

Notation:

  • Logits: \(z = [z_1, z_2, \dots, z_C]\)
  • Softmax: \(\hat{y}_i = \frac{e^{z_i}}{\sum_{k=1}^{C} e^{z_k}}\)
  • Cross-Entropy Loss: \(L = -\sum_{i=1}^{C} y_i \log \hat{y}_i\), where \(y_i\) is one-hot.

[Insert Neural Network Diagram Here]

Loss derivative w.r.t Softmax output

From the loss definition:

$$ \frac{\partial L}{\partial \hat{y}_i} = – \frac{y_i}{\hat{y}_i} $$

Explanation: This comes from derivative of \(\log x = 1/x\) and the negative sign.

[Insert Neural Network Diagram Here]

Softmax derivative as a Jacobian

The derivative of softmax is not a vector, but a matrix:

$$
\frac{\partial \hat{y}}{\partial z} =
\begin{bmatrix}
\frac{\partial \hat{y}_1}{\partial z_1} & \dots & \frac{\partial \hat{y}_1}{\partial z_C} \\
\vdots & \ddots & \vdots \\
\frac{\partial \hat{y}_C}{\partial z_1} & \dots & \frac{\partial \hat{y}_C}{\partial z_C}
\end{bmatrix}
$$

  • Diagonal elements (\(i=j\)) show self-influence.
  • Off-diagonal (\(i\neq j\)) show competition across classes.

Case 1: \(i=j\)

$$
\frac{\partial \hat{y}_i}{\partial z_i}
= \hat{y}_i (1 – \hat{y}_i)
$$

[Insert Neural Network Diagram Here]

Case 2: \(i \neq j\)

$$
\frac{\partial \hat{y}_i}{\partial z_j} = -\hat{y}_i \hat{y}_j
$$

Interpretation: Increasing \(z_j\) decreases \(\hat{y}_i\) for \(i \neq j\).

Compact Softmax Derivative

$$
\frac{\partial \hat{y}_i}{\partial z_j} =
\begin{cases}
\hat{y}_i(1-\hat{y}_i), & i=j \\
-\hat{y}_i \hat{y}_j, & i \neq j
\end{cases}
$$

Chain Rule

$$
\frac{\partial L}{\partial z_j} = \sum_{i=1}^{C} \frac{\partial L}{\partial \hat{y}_i} \frac{\partial \hat{y}_i}{\partial z_j}
$$

Gradient Simplification

$$
\frac{\partial L}{\partial z_j}
= -y_j(1 – \hat{y}_j) + \sum_{i \neq j} y_i \hat{y}_j
$$

Since only one \(y_i = 1\):

$$
\frac{\partial L}{\partial z_j} = \hat{y}_j – y_j
$$

Interpretation: Gradient = Predicted probability – True label

Final Result

$$
\boxed{\frac{\partial L}{\partial z} = \hat{y} – y}
$$

  • The gradient is simply predicted minus target.
  • No need to compute full Jacobian in practice.
  • Efficient for classification tasks.

PDF

GradientSMCE

Video

About PostNetwork Academy

  • Website: www.postnetwork.co
  • YouTube: www.youtube.com/@postnetworkacademy
  • Facebook: www.facebook.com/postnetworkacademy
  • LinkedIn: www.linkedin.com/company/postnetworkacademy
  • GitHub: www.github.com/postnetworkacademy

Thank You!

©Postnetwork-All rights reserved.