____________________________________________________
/\ \
\_| |
| z_0^h = sum_i^I(w[i, 0]^h * inputNode_i + b_0^h) |
| |
| hideNode_0 = tanh(z_0^h) |
| |
| _______________________________________________|_
\_/_________________________________________________/
|
| __________________________________________________
| /\ \
| \_| |
| | z_0^o = sum_j^J(w[j, 0]^o * hideNode_j + b_0^o |
| | |
inputNode: I | | ouputNode = softmax(z_0^o) |
hidenNode: J | | |
outputNode: K | | _____________________________________________|_
| \_/_______________________________________________/
| |
| |
| |
| |
| |
| |
I J | K |
| |
tanh(z_0^h) softmax(z_0^o)
***** ***** *****
** ** w[0, 0] ** ** w[0, 0] ** **
* * -----------> * b_0 * -------------> * *
** ** > ** ** > ** **
***** / ^ ***** -/ ^ *****
-/ / -/ /
/ | -/ /
w[1,0] / / w[1, 0]/ /
-/ / -/ /
/ | -/ |
***** / / ***** -/ /
** ** -/ / ** ** / /
* * | * b_1 * /
** ** / ** ** /
***** / ***** /
. / . /
. | . /
. / . /
/ |
w[i,0]| w[j, 0] /
/ /
***** / ***** / *****
** **_| ** **- ** **
* * * b_j * * *
** ** ** ** ** **
***** ***** *****
ith inputNode jth hideNode kth outputNode
1 Intro
Above drawit, only one hiden layer with activation funciton tanh, and the output layer with activation function softmax(归一化指数函数).
The input layer: have I nodes, the ith node is \(x_i\).
The hiden layer: have J nodes, weight: \(\theta_{ij}\), activation function: \(tanh^{hiden}(z_j)\), shortening: \(a^h(z_j)\) and \(z_j = \sum_i^I \theta_{ij} x_i + b_i \label{z_j} \tag{1}\).
The output layer: have K nodes, weight: \(\theta_{jk}\), activation function: \(softmax^{output}(z_k)\), shortening: \(a^o(z_k)\). and \(z_k = \sum_j^J \theta_{jk} x_j + b_j \label{z_k} \tag{2}\)
\(tanh(z_i)\) and its derivative (equation only contain \(z_i\)):
let \(tanh(z_i) = \dfrac{1-e^{-2z_i}}{1 + e^{-2z_i}}\) then:
\[ \begin{align*} tanh'(z_i) &= 1 - tanh^2(z_i) \\ &= (1 + tanh(z_i))(1 - tanh(z_i)) \\ \end{align*} \]
\(softmax(z_i)\) and its partial derivative (equation not only contain \(z_i\) but also contain other z-nodes in denominator):
let \(softmax(z_i) = \dfrac{e^z_i}{\sum_k^K e^{z_k}}\) then:
\[ \left\{\begin{align*} \text{ if } i = j; \ \ \dfrac{\partial softmax'(z_i)}{\partial z_i} &= \dfrac{e^{z_i} \sum_k^K e^{zk} - e^{z_i}e^{z_i}}{\left(\sum_k^K e^{z_k}\right)^2} \\ &= softmax(z_i)(1 - softmax(z_i)) \\ &= a_i(1-a_i) \label{i_eq_j} \tag{3} \\ \\ \text{ if } i \neq j; \ \ \dfrac{\partial softmax'(z_j)}{\partial z_i} &= \dfrac{0 - e^{z_j} e^{z_i}}{\left(\sum_k^K e^{z_k}\right)^2} \\ &= -softmax(z_i)softmax(z_j) \\ &= -a_ia_j \label{i_neq_j} \tag{4} \end{align*}\right. \]
2 SE and CE
SE loss function:
\[ Loss = \sum_{k=1}^{K} \dfrac{1}{2} \left(a^{o}(z_k) - y_k\right)^2 \]
MSE(mean squared error): \(\dfrac{Loss}{K}\)
with \(\ref{i_eq_j}\) and \(\ref{i_neq_j}\), so the partial derivative of \(z_k\) (the kth output node):
\[ \begin{align*} \dfrac{\partial Loss}{\partial z_i} &= \dfrac{\partial \sum_{k=1}^{K} \dfrac{1}{2} \left( a^{o}(z_k) - y_k\right )^2}{\partial z_i} \\ &= \sum_{k=1}^K \left(a^o(z_k) - y_k \right) \dfrac{\partial a^o(z_k)}{\partial z_i} \\ &= \sum_{k \neq i}^K \left(a^o(z_i) - y_i \right) \dfrac{\partial a^o(z_k)}{\partial z_i} + \sum_{k=i}^K \left(a^o(z_i) - y_i \right) \dfrac{\partial a^o(z_i)}{\partial z_i} \\ &= \sum_{k \neq i}^K \left(a^o(z_k) - y_k \right) \left( -a^o(z_k)a^o(z_i) \right ) + \left(a^o(z_i) - y_i \right )a^o(z_i) \left( 1-a^o(z_i) \right ) \\ &\approx \left(a^o(z_i) - y_i \right )a^o(z_i) \left( 1-a^o(z_i) \right ) \label{se_z_k} \tag{5} \end{align*} \]
above, the last equation \(\ref{se_z_k}\) is because the softmax is 'one-hot' encoding, \(\text{ if } k \neq i, \text{ then } a^o(z_k) a^o(z_i) \approx 0\)
CE loss function
\[ Loss = -\sum_{k=1}^K H\big(y_k, a^o(z_k)\big) = -\sum_{k=1}^K y_k log a^o(z_k) \]
MCEE(mean crossentropy error): \(\dfrac{Loss}{K}\)
so the partial derivative of \(z_k\), for short: Loss to L, \(a^o(z_i)\) to \(a_i\):
\[ \begin{align*} \dfrac{\partial L}{\partial z_i} &= \sum_k^K \big(\frac{\partial L}{\partial a_k} \frac{\partial a_k}{\partial z_i}\big) \\ &= \sum_{k\neq i}^K \big(\frac{\partial L}{\partial a_k} \frac{\partial a_k}{\partial z_i}\big) + \sum_{k=i}^K \big(\frac{\partial L}{\partial a_k} \frac{\partial a_k}{\partial z_i}\big) \\ &= \sum_{k\neq i}^K \big( -y_k\frac{1}{a_k} \frac{\partial a_k}{\partial z_i} \big) + \big( -y_k\frac{1}{a_k} \frac{\partial a_k}{\partial z_i} \big) \\ &= \sum_{k\neq i}^K \big( -y_k\frac{1}{a_k} (-a_i a_k) + \big( -y_i\frac{1}{a_i} a_i(1-a_i) \big) \\ &= \sum_{k\neq i}^K a_iy_k + (-y_i(1-a_i)) \\ &= \sum_{k\neq i}^K a_iy_k + a_iy_i -y_i \\ &= a_i\sum_k^K y_k - y_i \\ &= a_i - y_i \label{ce_z_k} \tag{6} \end{align*} \]
we call \(\frac{\partial L}{\partial z_i}\) above is the middle calculate signal, the last output layer as oSignal, the hiden layer as hSignal, the signal equation is different between SE and CE.
3 Gradient
look back \(\ref{z_j}\) and \(\ref{z_k}\).
below is the weight hoGrad as the hiden node to output node just is the partial derivative of Loss to \(\theta\), the \(hoGrad_{jk}\) denote the partial derivative of Loss to the weight where from the jth node in hiden layer and target to the ith output node:
\[ \text{hoGrad}_{jk} = \dfrac{\partial L}{\partial \theta_{jk}} = \dfrac{\partial L}{\partial z_k} \dfrac{\partial z_k}{\partial \theta_{jk}} = \dfrac{\partial L}{\partial z_k} a^h(z_j) = \text{oSignal}_k\ a^h(z_j) \]
and the bias obGrad: \(\text{obGrad}_k = \dfrac{\partial L}{\partial b_{jk}} = \text{oSignal}_k\ * 1\)
4 Hiden
let \(a_k\) short for \(a^o(z_k)\)
let \(a_j\) short for \(a^h(z_j)\)
let oSignal short for list \(\dfrac{\partial Loss}{\partial z_k}; k \in [1, K]\)
let hSignal short for list \(\dfrac{\partial Loss}{\partial z_j}; j \in [1, J]\)
if \(z_k = \theta_{jk} a_j + b_k;\ a_j = tanh(z_j);\ z_j = \theta_{ij} x_i + b_j\), then: \[ \left\{\begin{matrix} \dfrac{\partial z_k}{\partial \theta_{jk}} = a_j; & \dfrac{\partial z_k}{\partial b_k} = 1 \\ \dfrac{\partial z_j}{\partial \theta_{ij}} = x_i; & \dfrac{\partial z_j}{\partial b_j} = 1 \\ \end{matrix}\right. \label{d_weight_bias} \tag{7} \]
and
\[ \begin{matrix} \dfrac{\partial z_k}{\partial a_j} = \theta_{jk}; & \dfrac{\partial z_j}{\partial x_i} = \theta_{ij} \\ \end{matrix} \label{d_a_x} \tag{8} \]
and
\[ \begin{align*} \dfrac{\partial a_j}{\partial z_j} &= \dfrac{\partial tanh(z_j)}{\partial z_j} \\ &= \big(1+tanh(z_j)\big)\big(1-tanh(z_j)\big) \\ &= (1+a_j)(1-a_j) \end{align*} \label{d_tanh} \tag{9} \]
from \(\ref{d_weight_bias}\), \(\ref{d_a_x}\) and \(\ref{d_tanh}\), we can get the partial derivative of the jth hide node \(z_j\) by Loss is (each output node contain the weight of \(z_j\)):
\[ \begin{align*} \text{hSignal}_j = \dfrac{\partial Loss}{\partial z_j} &= \sum_{k=1}^{K} \dfrac{\partial Loss}{\partial a_k} \dfrac{\partial a_k}{\partial z_k} \dfrac{\partial z_k}{\partial a_j} \dfrac{\partial a_j}{\partial z_j} \\ &= \sum_{k=1}^{K} \dfrac{\partial Loss}{\partial z_k} \dfrac{\partial z_k}{\partial a_j} \dfrac{\partial a_j}{\partial z_j} \\ &= \sum_{k=1}^{K} \text{oSignal}_k \dfrac{\partial z_k}{\partial a_j} \dfrac{\partial a_j}{\partial z_j} \\ &= \sum_{k=1}^{K} \text{oSignal}_k \theta_{jk} \dfrac{\partial a_j}{\partial z_j} \\ &= \dfrac{\partial a_j}{\partial z_j} \sum_{k=1}^{K} \text{oSignal}_k \theta_{jk} \\ &= (1+a_j)(1-a_j) \sum_{k=1}^{K} \text{oSignal}_k \theta_{jk} \label{hsignal_z_j} \tag{10} \\ \end{align*} \]
for the partial derivative of the jth hide node's weight:
\[ \begin{align*} \text{ihGrad}_{ij} &= \dfrac{\partial Loss}{\partial \theta_{ij}} \\ &= \dfrac{\partial Loss}{\partial z_j} \dfrac{\partial z_j}{\partial \theta_{ij}} \\ &= \text{hSignal}_j * x_i \end{align*} \]