Logits: $z = [z_1, z_2, \dots, z_C]$
Softmax: $\hat{y}_i = \frac{e^{z_i}}{\sum_{k=1}^{C} e^{z_k}}$
Cross-Entropy Loss: $L = -\sum_{i=1}^{C} y_i \log \hat{y}_i$, where $y_i$ is one-hot.

[Insert Neural Network Diagram Here]

Loss derivative w.r.t Softmax output

From the loss definition:

$$ \frac{\partial L}{\partial \hat{y}_i} = – \frac{y_i}{\hat{y}_i} $$

Explanation: This comes from derivative of $\log x = 1/x$ and the negative sign.

[Insert Neural Network Diagram Here]

Softmax derivative as a Jacobian

The derivative of softmax is not a vector, but a matrix:

$$
\frac{\partial \hat{y}}{\partial z} =
\begin{bmatrix}
\frac{\partial \hat{y}_1}{\partial z_1} & \dots & \frac{\partial \hat{y}_1}{\partial z_C} \\
\vdots & \ddots & \vdots \\
\frac{\partial \hat{y}_C}{\partial z_1} & \dots & \frac{\partial \hat{y}_C}{\partial z_C}
\end{bmatrix}
$$

Diagonal elements ($i=j$) show self-influence.
Off-diagonal ($i\neq j$) show competition across classes.

Case 1: $i=j$

$$
\frac{\partial \hat{y}_i}{\partial z_i}
= \hat{y}_i (1 – \hat{y}_i)
$$

[Insert Neural Network Diagram Here]

Case 2: $i \neq j$

$$
\frac{\partial \hat{y}_i}{\partial z_j} = -\hat{y}_i \hat{y}_j
$$

Interpretation: Increasing $z_j$ decreases $\hat{y}_i$ for $i \neq j$.

Compact Softmax Derivative

$$
\frac{\partial \hat{y}_i}{\partial z_j} =
\begin{cases}
\hat{y}_i(1-\hat{y}_i), & i=j \\
-\hat{y}_i \hat{y}_j, & i \neq j
\end{cases}
$$

Chain Rule

$$
\frac{\partial L}{\partial z_j} = \sum_{i=1}^{C} \frac{\partial L}{\partial \hat{y}_i} \frac{\partial \hat{y}_i}{\partial z_j}
$$

Gradient Simplification

$$
\frac{\partial L}{\partial z_j}
= -y_j(1 – \hat{y}_j) + \sum_{i \neq j} y_i \hat{y}_j
$$

Since only one $y_i = 1$:

$$
\frac{\partial L}{\partial z_j} = \hat{y}_j – y_j
$$

Interpretation: Gradient = Predicted probability – True label

Final Result

$$
\boxed{\frac{\partial L}{\partial z} = \hat{y} – y}
$$

The gradient is simply predicted minus target.
No need to compute full Jacobian in practice.
Efficient for classification tasks.

PDF

GradientSMCE

Video

About PostNetwork Academy

Website: www.postnetwork.co
YouTube: www.youtube.com/@postnetworkacademy
Facebook: www.facebook.com/postnetworkacademy
LinkedIn: www.linkedin.com/company/postnetworkacademy
GitHub: www.github.com/postnetworkacademy

Gradient of Softmax + Cross-Entropy w.r.t Logits

Gradient of Softmax + Cross-Entropy w.r.t Logits

Goal

Loss derivative w.r.t Softmax output

Softmax derivative as a Jacobian

Case 1: \(i=j\)

Case 2: \(i \neq j\)

Compact Softmax Derivative

Chain Rule

Gradient Simplification

Final Result

PDF

Video

About PostNetwork Academy

©Postnetwork-All rights reserved.