这一部分是斋藤康毅先生所编写的《深度学习入门·基于Python的理论和实现》的补充,在Chapter 5 · 误差反向传播法中,作者对全连接层和Softmax层只给出了公式,对相关推导过程进行了省略,本文主要解决问题为 : 作者在文中所提出的公式如下所示,该公式应如何理解
假设全连接层计算公式为 : Y = X * W + B,则有 : 1. ∂ L ∂ B = ∂ L ∂ Y \frac {\partial L}{\partial B} = \frac {\partial L}{\partial Y} ∂B∂L=∂Y∂L 2. ∂ L ∂ W = X T ∗ ∂ L ∂ Y \frac {\partial L}{\partial W} = X^{T} * \frac {\partial L}{\partial Y} ∂W∂L=XT∗∂Y∂L 3. ∂ L ∂ X = ∂ L ∂ Y ∗ W T \frac {\partial L}{\partial X} = \frac {\partial L}{\partial Y} * W^{T} ∂X∂L=∂Y∂L∗WT
PS :
本文为本人的阅读笔记,只能作为对书本的补充理解,具体的知识请参阅书本在阅读前,您需阅读该文章,并对 在正向传播时有分支流出,则反向传播时它们的反向传播值会相加 这一结论有所理解 : 基于计算图的Softmax层反向传播推导∂ L ∂ B = ∂ L ∂ Y \frac {\partial L}{\partial B} = \frac {\partial L}{\partial Y} ∂B∂L=∂Y∂L
由图可知,B 与 X * W 关系为相加,对加结点而言,下级导数等于上级导数
∂ L ∂ W = X T ∗ ∂ L ∂ Y \frac {\partial L}{\partial W} = X^{T} * \frac {\partial L}{\partial Y} ∂W∂L=XT∗∂Y∂L
矩阵形状以图中样例为例,W 矩阵为 (2,3) ,X 矩阵为 (1,2),令 M = X * W ,假设 wij 为对 mi 而言,xj 的权重,由全连接层计算公式,有 : m i = ∑ w i j ∗ x j m_{i} = \sum {w_{ij} * x_{j}} mi=∑wij∗xj
所以可知 : wij 在全连接层输出Y的计算中,出现且只出现一次,所以 : ∂ Y ∂ w i j = x j \frac {\partial Y}{\partial w_{ij}} = x_{j} ∂wij∂Y=xj
又 : 对 mi 而言,上层传递的导数为 ∂ L ∂ y i \frac{\partial L}{\partial y_{i}} ∂yi∂L。
以该图为例构造L对参数W的导数矩阵U,以实现更新公式 : W = W - α * U,则有 : ∂ L ∂ W = ∂ L ∂ Y ∗ ∂ Y ∂ W = [ ∂ L ∂ y 1 ∗ x 1 ∂ L ∂ y 2 ∗ x 1 ∂ L ∂ y 3 ∗ x 1 ∂ L ∂ y 1 ∗ x 2 ∂ L ∂ y 2 ∗ x 2 ∂ L ∂ y 3 ∗ x 2 ] = [ x 1 x 2 ] ∗ [ ∂ L ∂ y 1 ∂ L ∂ y 2 ∂ L ∂ y 3 ] = X T ∗ ∂ L ∂ Y \frac{\partial L}{\partial W} = \frac{\partial L}{\partial Y} * \frac{\partial Y}{\partial W} = \left[ \begin{matrix} \frac{\partial L}{\partial y_{1}} * x_{1} & \frac{\partial L}{\partial y_{2}} * x_{1} & \frac{\partial L}{\partial y_{3}} * x_{1} \\ \frac{\partial L}{\partial y_{1}} * x_{2} & \frac{\partial L}{\partial y_{2}} * x_{2} & \frac{\partial L}{\partial y_{3}} * x_{2} \end{matrix} \right] = \left[ \begin{matrix} x_{1} \\ x_{2} \end{matrix} \right] * \left[ \begin{matrix} \frac{\partial L}{\partial y_{1}} & \frac{\partial L}{\partial y_{2}} & \frac{\partial L}{\partial y_{3}} \end{matrix} \right] = X^{T} * \frac{\partial L}{\partial Y} ∂W∂L=∂Y∂L∗∂W∂Y=[∂y1∂L∗x1∂y1∂L∗x2∂y2∂L∗x1∂y2∂L∗x2∂y3∂L∗x1∂y3∂L∗x2]=[x1x2]∗[∂y1∂L∂y2∂L∂y3∂L]=XT∗∂Y∂L
∂ L ∂ X = ∂ L ∂ Y ∗ W T \frac {\partial L}{\partial X} = \frac {\partial L}{\partial Y} * W^{T} ∂X∂L=∂Y∂L∗WT
假设 Y( x1,x2 ) = Y( u( x1 , x2 ) , f( x1, x2 ) , φ( x1, x2 ) ),其中 u , f , φ 对应着 y1 , y2 , y3 的输出,以 x1 为例,有 :
∂ L ∂ x 1 = ∂ L ∂ Y ∗ ∂ Y ∂ x 1 = ∂ L ∂ Y ∗ ( ∂ Y ∂ u ∗ ∂ u ∂ x 1 , ∂ Y ∂ f ∗ ∂ f ∂ x 1 , ∂ Y ∂ φ ∗ ∂ φ ∂ x 1 ) = ∂ L ∂ Y ∗ ( w 11 , w 12 , w 13 ) T = w 11 ∗ ∂ L ∂ y 1 + w 12 ∗ ∂ L ∂ y 2 + w 13 ∗ ∂ L ∂ y 3 \frac {\partial L}{\partial x_{1}} = \frac {\partial L}{\partial Y } * \frac {\partial Y}{\partial x_{1}} = \frac {\partial L}{\partial Y} * (\frac {\partial Y}{\partial u} * \frac {\partial u}{\partial x_{1}} , \frac {\partial Y}{\partial f} * \frac {\partial f}{\partial x_{1}} , \frac {\partial Y}{\partial φ} * \frac {\partial φ}{\partial x_{1}}) = \frac {\partial L}{\partial Y} * (w_{11} , w_{12} , w_{13})^{T} = w_{11}*\frac {\partial L}{\partial y1} + w_{12} * \frac {\partial L}{\partial y2} + w_{13} * \frac {\partial L}{\partial y3} ∂x1∂L=∂Y∂L∗∂x1∂Y=∂Y∂L∗(∂u∂Y∗∂x1∂u,∂f∂Y∗∂x1∂f,∂φ∂Y∗∂x1∂φ)=∂Y∂L∗(w11,w12,w13)T=w11∗∂y1∂L+w12∗∂y2∂L+w13∗∂y3∂L
即 :
∂ L ∂ x 1 = ∂ L ∂ Y ∗ ( w 11 , w 12 , w 13 ) T \frac {\partial L}{\partial x_{1}} = \frac {\partial L}{\partial Y } * (w_{11} , w_{12} , w_{13})^{T} ∂x1∂L=∂Y∂L∗(w11,w12,w13)T
∂ L ∂ x 2 = ∂ L ∂ Y ∗ ( w 21 , w 22 , w 23 ) T \frac {\partial L}{\partial x_{2}} = \frac {\partial L}{\partial Y} * (w_{21} , w_{22} , w_{23})^{T} ∂x2∂L=∂Y∂L∗(w21,w22,w23)T
所以 :
∂ L ∂ X = ∂ L ∂ Y ∗ [ w 11 w 12 w 13 w 21 w 22 w 23 ] = ∂ L ∂ Y ∗ W T \frac {\partial L}{\partial X} = \frac {\partial L}{\partial Y} * \left[ \begin{matrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \end{matrix} \right] = \frac {\partial L}{\partial Y} * W^{T} ∂X∂L=∂Y∂L∗[w11w21w12w22w13w23]=∂Y∂L∗WT