【源码】Convolutional Two-Stream Network Fusion for Video Action Recognition

mac2022-06-30 105

Convolutional Two-Stream Network Fusion for Video Action Recognition

环境准备运行代码代码阅读依赖关系目录结构 cnn_ucf101_spatial（1）输入层（2）最后一个全连接层（3）设置loss和derOutputs（4）dropout、top1error和top5error cnn_ucf101_temporalcnn_ucf101_fusion找到需要融合的层（1）空间融合：添加Concat fusion层（2）时间融合：Conv3D（3）时间融合：Pool3D设置输出导数

本文的目的是运行并分析双流代码，包括环境配置、数据集和模型准备、代码概览。关于论文的内容概括请看另一篇博客：https://blog.csdn.net/u013588351/article/details/102074562

相关链接简介：http://www.robots.ox.ac.uk/~vgg/software/two_stream_action/ 代码：https://github.com/feichtenhofer/twostreamfusion 数据：http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/data/ 数据（百度云）：https://pan.baidu.com/s/1Veq9a0n2S_2lebbo8OFZUw 提取码: eed7

环境准备

下载代码 https://github.com/feichtenhofer/twostreamfusion

编译matconvnet

安装C++编译环境使用Visual Studio编译【推荐】使用MinGW编译在matlab中配置mex使用C++编译运行compile.m

用GPU编译：

错误使用 vl_nnconv An input is not a numeric array (or GPU support not compiled). 解决：将vl_compilenn.m 502行 mopts = {’-outdir’, fileparts(tgt), src, ‘-c’, mex_opts{:}} ; 改为 mopts = {’-outdir’, fileparts(tgt), src, ‘-c’, mex_opts{:}, ‘-largeArrayDims’} ;

用CPU编译：

‘cl.exe’ 不是内部或外部命令，也不是可运行的程序或批处理文件。警告: CL.EXE not found in PATH. Trying to guess out of mex setup. 解决：在Visual Studio安装目录下搜索cl.exe，将其所在的目录添加到环境变量中，重启Matlab

错误使用 mex data.cpp error C2027: use of undefined type ‘vl::CudaHelper’ note: see declaration of ‘vl::CudaHelper’ error C2228: left of ‘.getLastCudnnErrorMessage’ must have class/struct/union 解决：将原来的compile.m中的vl_compilenn注释，改成编译为CPU版本

错误使用 make_all>search_cuda_devkit (line 442) Could not find a valid NVCC executable\n 解决：将MexConv3D/make_all.m第19行改为opts.enableGpu = false;

运行代码

以运行cnn_ucf101_spatial为例，运行cnn_ucf101_temporal的方法类似。

伪装数据集注释掉三行，再添加两行就好。

下载imagenet预训练模型

模型下载地址大小提出时间vgg-mimagenet-vgg-m-2048.mat329M2013vgg-16imagenet-vgg-verydeep-16.mat491M2014res-50imagenet-resnet-50-dag.mat91.5M2015res101imagenet-resnet-101-dag.mat159M2015res152imagenet-resnet-152-dag.mat215M2015

默认是res-50，下载完成后放到models目录下，并对应地修改model的值：一定要用上面的链接下载，不要下载官网上最新版的模型，否则后面运行到net = dagnn.DagNN.loadobj(net)时会报错：类dagnn.Conv不存在公共字段dilate

正常情况下，这个时候就可以运行cnn_ucf101_spatial了，只是不能训练。

使用F12添加断点，F5调试，F10单步执行。在左下角工作区可以看到变量值，下方命令行窗口可以在调试过程中实时执行命令。

下载时空预训练网络

如果要运行融合网络cnn_ucf101_fusion，就要先下载预训练好的时空网络。res50、res152、vgg16里选一个，最好下vgg16，不然别的模型的层的名字不一样，后面代码要改很多地方。下载好之后放到models目录下，在代码中修改对应地文件名即可：

opts.modelA = fullfile(opts.modelPath, [opts.dataSet '-img-vgg16-split' num2str(opts.nSplit) '.mat']) ; opts.modelB = fullfile(opts.modelPath, [opts.dataSet '-TVL1flow-vgg16-split' num2str(opts.nSplit) '.mat']) ;

代码阅读

依赖关系

目录结构

程序的入口文件： cnn_ucf101_spatial：训练空间网络 cnn_ucf101_temporal：训练时间网络 cnn_ucf101_fusion：使用已训练好的模型，并训练最终的融合网络

获取数据的三个文件： cnn_ucf101_get_frame_batch：获取RGB单帧图像 cnn_ucf101_get_flow_batch：获取光流图像 cnn_ucf101_get_im_flow_batch：获取RGB图像+光流图像

其他文件： cnn_setup_environment：初始化环境变量 cnn_ucf101_setup_data：初始化数据 cnn_train_dag：训练CNN的，跟具体模型无关 compile：编译matconvnet库

cnn_ucf101_spatial

空间网络实际上就是一个标准的CNN，输入为224x224的RGB图像，输出为UCF-101数据集上的分类，因此其输入维度为[224 224 3]，输出维度为[101 1]。中间的网络架构可以使用通用的vgg-m，vgg16，res50，res101，res152等。

文章使用在imagenet上预训练好的网络，这样有几个好处：（1）网络结构的有效性经过大量实验验证，（2）经过预训练后中间的隐藏层已经具有一定提取复杂图像特征的能力，（3）中间的隐藏层的结构不用再定义啦，载入现成模型就好。

将预训练过的网络结构按照实验需要做轻微的修改即可得到空间网络，为了确保输入和输出和实验数据一致，需要修改（1）输入层的权重和偏置（2）最后一个全连接层的权重和偏置；其次为了训练需要（3）设置自己的目标函数loss和derOutputs；最后为了实验，需要添加（4）dropout、top1error和top5error。此外还需要设置numEpochs、epochFactor、learningRate、batchSize等参数。

关于DagNN和SimpleNN：官方文档：DagNN 官方文档：SimpleNN 这篇文章说了一下DagNN和SimpleNN的一些区别 net一开始可能是DagNN类型，也可能是SimpleNN类型。如果是SimpleNN类型，会通过dagnn.DagNN.fromSimpleNN(net)转换成DagNN，最后通过cnn_train_dag来训练

主要参数

opts.dataSet = 'ucf101'; opts.dropOutRatio = 0 ; opts.inputdim = [ 224, 224, 3] ; opts.train.batchSize = 256 ; opts.train.augmentation = 'borders25'; opts.train.learningRate = [1e-2*ones(1, 2) 1e-2*ones(1, 3) 1e-3*ones(1, 3) 1e-4*ones(1, 3)] ; imdb = load(opts.imdbPath) ; % 图像数据集 nClasses = length(imdb.classes.name); % 数据集分类数 net = load(opts.model); % 预训练过的网络

（1）输入层

这一步无需修改，因为预训练好的模型的数据就是[224 224 3]，和网络一致。

（2）最后一个全连接层

修改最后一个全连接层的两个参数

fc_filter维度从 [1 1 2048 1000] 变为 [1 1 2048 101] 并随机初始化fc_bias维度从 [1000 1] 变为 [101 1] 并初始化为0 % imagenet有1000个分类，所以分类层输出是一个(1000, 1)的向量 % 修改最后一个全连接层的权重和偏置的维度，使得输出维度为(101, 1) % replace 1000-way imagenet classifiers for p = 1 : numel(net.params) sz = size(net.params(p).value); disp(sz); if any(sz == 1000) sz(sz == 1000) = nClasses; fprintf('replace classifier layer of %s\n', net.params(p).name); % 将fc1000_filter维度从[1 1 2048 1000]改为[1 1 2048 101] if numel(sz) > 2 net.params(p).value = 0.01 * randn(sz, class(net.params(p).value)); % 将fc1000_bias维度从[1000 1]改为[101 1] else net.params(p).value = zeros(sz, class(net.params(p).value)); end end end % 设置normalization.border = [32 32] net.meta.normalization.border = [256 256] - net.meta.normalization.imageSize(1:2); net = dagnn.DagNN.loadobj(net); if strfind(model, 'bnorm') net = insert_bnorm_layers(net) ; end

（3）设置loss和derOutputs

% 移除Softmax层，添加Loss层，并将Loss设置为derOutputs用于反向传播 opts.train.derOutputs = {} ; for l=numel(net.layers):-1:1 if isa(net.layers(l).block, 'dagnn.Loss') && isempty(strfind(net.layers(l).name, 'err')) opts.train.derOutputs = {opts.train.derOutputs{:}, net.layers(l).outputs{:}, 1} ; end % 移除Softmax层 if isa(net.layers(l).block, 'dagnn.SoftMax') net.removeLayer(net.layers(l).name) l = l - 1; end end if isempty(opts.train.derOutputs) % 添加Loss层 net = dagnn.DagNN.insertLossLayers(net, 'numClasses', nClasses) ; fprintf('setting derivative for layer %s \n', net.layers(end).name); % 设置模型的输出 opts.train.derOutputs = {opts.train.derOutputs{:}, net.layers(end).outputs{:}, 1} ; end

（4）dropout、top1error和top5error

% 根据opts.dropOutRatio设置dropout层 if ~isnan(opts.dropOutRatio) dr_layers = find(arrayfun(@(x) isa(x.block,'dagnn.DropOut'), net.layers)) ; % 更新现有的dropout层 if ~isempty(dr_layers) if opts.dropOutRatio > 0 for i=dr_layers, net.layers(i).block.rate = opts.dropOutRatio; end else net.removeLayer({net.layers(dr_layers).name}); end else % 在net中没有找到dropout层，在最后一个pooling层后面添加dropout if opts.dropOutRatio > 0 pool5_layer = find(arrayfun(@(x) isa(x.block,'dagnn.Pooling'), net.layers)) ; conv_layers = pool5_layer(end); for i=conv_layers block = dagnn.DropOut() ; block.rate = opts.dropOutRatio ; newName = ['drop_' net.layers(i).name]; net.addLayer(newName, ... block, ... net.layers(i).outputs, ... {newName}) ; for l = 1:numel(net.layers)-1 for f = net.layers(i).outputs sel = find(strcmp(f, net.layers(l).inputs )) ; if ~isempty(sel) [net.layers(l).inputs{sel}] = deal(newName) ; end end end end end end end % 在loss后面添加两层，用来计算错误率 lossLayers = find(arrayfun(@(x) isa(x.block,'dagnn.Loss') && strcmp(x.block.loss,'softmaxlog'),net.layers)); net.addLayer('top1error', ... dagnn.Loss('loss', 'classerror'), ... net.layers(lossLayers(end)).inputs, ... 'top1error') ; net.addLayer('top5error', ... dagnn.Loss('loss', 'topkerror', 'opts', {'topK', 5}), ... net.layers(lossLayers(end)).inputs, ... 'top5error') ;

其他训练参数

opts.train.train = find(ismember(imdb.images.set, [1])) ; opts.train.train = repmat(opts.train.train,1,opts.train.epochFactor); opts.train.valmode = '250samples'; opts.train.denseEval = 1;

opts.train.train是一个索引数组，指定参与训练的数据的索引值所有训练数据重复opts.train.epochFactor次，这样每项数据都会多次参与训练

cnn_ucf101_temporal

和空间网络区别不大，主要是输入维度从[224, 224, 3]变成了[224, 224, 20]，对应输入层的权重也要增加

opts.inputdim = [netNorm.imageSize(1:2), 20] ; net.layers{1}.weights{1} = repmat(mean(net.layers{1}.weights{1},3), [1 1 opts.inputdim(3) 1]) ; net.meta.normalization.averageImage = []; net.meta.normalization.border = [256 256] - netNorm.imageSize(1:2); net = replace_last_layer(net, [1 2], [1 2], nClasses, opts.dropOutRatio); net.normalization.imageSize = opts.inputdim ;

cnn_ucf101_fusion

主要参数

addConv3D = 1 ; addPool3D = 1 ; doSum = 0 ; imdb = load(opts.imdbPath) ; netA = load(opts.modelA) ; netB = load(opts.modelB) ;

找到需要融合的层

opts.train.fusionType = 'conv'; opts.train.fusionLayer = {'relu5_3', 'relu5_3'; }; fusionLayerA = []; fusionLayerB = []; if ~isempty(opts.train.fusionLayer) for i=1:numel(netA.layers) if isfield(netA.layers(i),'name') && any(strcmp(netA.layers(i).name,opts.train.fusionLayer(:,1))) fusionLayerA = [fusionLayerA i]; end end for i=1:numel(netB.layers) if isfield(netB.layers(i),'name') && any(strcmp(netB.layers(i).name,opts.train.fusionLayer(:,2))) fusionLayerB = [fusionLayerB i]; end end end

netA和netB在结构上是一样的，它们的全连接层之前的最后一个Relu层的名字都是relu5_3，融合就在这一层上进行，下面就进行融合。

（1）空间融合：添加Concat fusion层

for i = 1:size(opts.train.fusionLayer,1) if strcmp(opts.train.fuseInto,'spatial') i_fusion = find(~cellfun('isempty', strfind({net.layers.name}, ... [opts.train.fusionLayer{i,1} '_' opts.train.fuseInto]))); else i_fusion = find(~cellfun('isempty', strfind({net.layers.name}, ... [opts.train.fusionLayer{i,2} '_' opts.train.fuseInto]))); end name_concat = [opts.train.fusionLayer{i,2} '_concat']; if doSum block = dagnn.Sum() ; net.addLayerAt(i_fusion(end), name_concat, block, ... [net.layers(strcmp({net.layers.name},[opts.train.fusionLayer{i,1} '_spatial'])).outputs ... net.layers(strcmp({net.layers.name},[opts.train.fusionLayer{i,2} '_temporal'])).outputs], ... name_concat) ; else block = dagnn.Concat() ; net.addLayerAt(i_fusion(end), name_concat, block, ... [net.layers(strcmp({net.layers.name},[opts.train.fusionLayer{i,1} '_spatial'])).outputs ... net.layers(strcmp({net.layers.name},[opts.train.fusionLayer{i,2} '_temporal'])).outputs], ... name_concat) ; end % set input for fusion layer net.layers(i_fusion(end)+2).inputs{1} = name_concat; end

关键代码：

% 添加融合层 net.addLayerAt( i_fusion(end), ... % 融合层的位置（感觉只是网络id，没有什么影响） name_concat, ... % 融合层的名字（relu5_3_concat） block, ... % 融合层的类型（dagnn.Sum()或dagnn.Concat()） 'relu5_3_spacial' ... % 融合层输入1 'relu5_3_temporal', ... % 融合层输入2 name_concat ... % 融合层输出（relu5_3_concat） ); % 将融合层设置为pool5_spacial的输入 net.layers(i_fusion(end)+2).inputs{1} = name_concat;

上面这段代码通过在relu5和pool5之间插入一层concat，将两个网络融合起来，也就是下图中红方框处的融合：

融合前网络结构如下：融合后网络结构如下：

（2）时间融合：Conv3D

if addConv3D block = dagnn.Conv3D() ; params(1).name = 'conv3Df' ; in = size(net.params(net.getParamIndex('conv5_3f_spatial')).value,4) + ... size(net.params(net.getParamIndex('conv5_3f_temporal')).value,4) ; out = 512; kernel = eye(in/2,out,'single'); kernel = cat(1, .25 * kernel, .75 * kernel); kernel = permute(kernel, [4 5 3 1 2]); sigma = 1; [X,Y,Z] = ndgrid(-1:1, -1:1, -1:1); G3 = exp( -((X.*X)/(sigma*sigma) + (Y.*Y)/(sigma*sigma) + (Z.*Z)/(sigma*sigma))/2 ); G3 = G3./sum(G3(:)); kernel = bsxfun(@times, kernel, G3); params(1).value = kernel; params(2).name = 'conv3Db' ; params(2).value = zeros(1, out ,'single') ; pads = size(kernel); pads = ceil(pads(1:3) / 2) - 1; block.pad = [pads(1),pads(1), pads(2),pads(2), pads(3),pads(3)]; block.stride = [1 1 1]; block.size = size(kernel); i_relu5 = find(~cellfun('isempty', strfind({net.layers.name},'relu5_3_concat'))); net.addLayerAt(i_relu5, 'conv53D', block, ... [net.layers(i_relu5).outputs ], ... 'conv3D5', {params.name}) ; net.params(net.getParamIndex(params(1).name)).value = params(1).value ; net.params(net.getParamIndex(params(2).name)).value = params(2).value ; block = dagnn.ReLU() ; net.addLayerAt(i_relu5+1, 'relu3D5', block, ... [net.layers(i_relu5+1).outputs ], ... 'relu3D5') ; net.layers(find(~cellfun('isempty', strfind({net.layers.name},['pool5_' opts.train.fuseInto])))).inputs = {'relu3D5'}; end

在relu5_3_concat后面加一层conv3D5和relu3D5，再接到pool5上

（3）时间融合：Pool3D

if addPool3D block = dagnn.Pooling3D() ; block.method = 'max' ; i_pool5 = find(~cellfun('isempty', strfind({net.layers.name},['pool5_' opts.train.fuseInto]))); block.poolSize = [net.layers(i_pool5).block.poolSize nFrames]; block.pad = [net.layers(i_pool5).block.pad 0,0]; block.stride = [net.layers(i_pool5).block.stride 2]; net.addLayerAt(i_pool5, ['pool3D5_' opts.train.fuseInto], block, ... [net.layers(i_pool5).inputs], ... [net.layers(i_pool5).outputs]) ; net.removeLayer(['pool5_' opts.train.fuseInto], 0) ; i_pool5 = find(~cellfun('isempty', strfind({net.layers.name},['pool5_' opts.train.fuseFrom ]))); if ~isempty(i_pool5) block = dagnn.Pooling3D() ; block.poolSize = [net.layers(i_pool5).block.poolSize nFrames]; block.pad = [net.layers(i_pool5).block.pad 0,0]; block.stride = [net.layers(i_pool5).block.stride 2]; net.addLayerAt(i_pool5, ['pool3D5_' opts.train.fuseFrom], block, ... [net.layers(i_pool5).inputs], ... [net.layers(i_pool5).outputs]) ; net.removeLayer(['pool5_' opts.train.fuseFrom ], 0) ; end end

用pool3D5替换pool5

设置输出导数

opts.train.derOutputs = {} ; for l=1:numel(net.layers) if isa(net.layers(l).block, 'dagnn.Loss') && isempty(strfind(net.layers(l).block.loss, 'err')) if opts.backpropFuseFrom || ~isempty(strfind(net.layers(l).name, opts.train.fuseInto )) fprintf('setting derivative for layer %s \n', net.layers(l).name); opts.train.derOutputs = [opts.train.derOutputs, net.layers(l).outputs, {1}] ; end net.addLayer(['err1_' net.layers(l).name(end-7:end) ], dagnn.Loss('loss', 'classerror'), ... net.layers(l).inputs, 'error') ; end end

在融合网络中，有两个Loss输出：loss_spacial和loss_temporal，使用哪个Loss来反向传播呢？首先，整个融合网络是从时间网络融入（fuse into）空间网络的，所以空间网络的损失loss_spacial必定要用于反向传播，这样才能使得融合是有效的。

接着又引出两个问题，一：反向传播可以使用多个Loss吗？二：是否将loss_temporal用于反向传播？第一个问题的答案是肯定的，第二个问题的答案由两个参数决定：opts.backpropFuseFrom和opts.train.removeFuseFrom。

若opts.backpropFuseFrom为真，则将loss_temporal也加入反向传播，这样网络就有两条反向传播路线，通过调试和源码注释可以验证这一点：若opts.train.removeFuseFrom为真，则整个时间网络的后半段都被删掉（如下图所示），loss_temporal自然也被删掉：

具体地，两个Loss怎么同时反向传播呢，是顺序的还是交替的？反向传播的顺序可用如下命令获得：

net.layers(fliplr(net.getLayerExecutionOrder())).name

err1_temporal err1__spatial loss39_temporal loss39_spatial prediction_temporal prediction_spatial layer37_temporal layer37_spatial relu7_temporal relu7_spatial fc7_temporal fc7_spatial layer34_temporal layer34_spatial relu6_temporal relu6_spatial fc6_temporal fc6_spatial ------------------------------ pool3D5_temporal pool3D5_spatial relu3D5 conv53D relu5_3_concat ------------------------------ relu5_3_temporal relu5_3_spatial conv5_3_temporal conv5_3_spatial relu5_2_temporal relu5_2_spatial conv5_2_temporal conv5_2_spatial relu5_1_temporal relu5_1_spatial conv5_1_temporal conv5_1_spatial pool4_temporal pool4_spatial relu4_3_temporal relu4_3_spatial conv4_3_temporal conv4_3_spatial relu4_2_temporal relu4_2_spatial conv4_2_temporal conv4_2_spatial relu4_1_temporal relu4_1_spatial conv4_1_temporal conv4_1_spatial pool3_temporal pool3_spatial relu3_3_temporal relu3_3_spatial conv3_3_temporal conv3_3_spatial relu3_2_temporal relu3_2_spatial conv3_2_temporal conv3_2_spatial relu3_1_temporal relu3_1_spatial conv3_1_temporal conv3_1_spatial pool2_temporal pool2_spatial relu2_2_temporal relu2_2_spatial conv2_2_temporal conv2_2_spatial relu2_1_temporal relu2_1_spatial conv2_1_temporal conv2_1_spatial pool1_temporal pool1_spatial relu1_2_temporal relu1_2_spatial conv1_2_temporal conv1_2_spatial relu1_1_temporal relu1_1_spatial conv1_1_temporal conv1_1_spatial

可以看到在融合层的前后，两个网络交替执行。

最新回复(0)