首先感谢两位大神的文章,写的特别清楚,一篇知乎的,一篇的,链接为:
RCNN- 将CNN引入目标检测的开山之作
R-CNN论文详解
所以我也不写那么详细了,只是记录一下看论文时的一些标记。
网络测试时的总的思路就是:先用Selective Search方法选出2000个目标可能存在于的区域,然后分别送进CNN网络提取特征,每个ROI都得到一个4096维的特征向量,接着把这些向量送进20个SVM分类器(二分类分类器),即可得到每个ROI的类别,然后每个类别NMS之后选出的那些高概率的ROI再送进位置精修,得到一个好的定位效果。
注:这个开山之作可能是第一次引入了迁移学习的思想,由于分类数据集比检测数据集大很多,所以网络训练时,是先将特征提取网络在分类数据集ImageNet ILSVC 2012训练好,然后在VOC检测数据集上进行微调,微调时的学习率一般都要小十倍(为了记忆原来学到的特征)。
还有一个上两篇博客没有说清楚的点就是:SVM训练和测试时用来分类的正样本的4096维特征向量的来源不同:训练时是通过每个GT经过特征提取网络得到的,测试时通过Selective Search方法得到的2000个ROI经过特征提取网络得到的,因为SVM训练需要的样本相对于CNN较少,这样SVM得到的分类精度更高一些,而测试时候就没有标注信息了,所以这时SVM就就直接用的SS方法产生的框,对他们进行分类,然后极大值进行抑制,再送进位置精修。
Instead, we solve the CNN localization problem by operating within the “recognition using regions” paradigm [21],which has been successful for both object detection [39] and semantic segmentation [5]. At test time, our method generates around 2000 category-independent region proposals for the input image, extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVMs. We use a simple technique (affine image warping) to compute a fixed-size CNN input from each region proposal, regardless of the region’s shape. Figure 1 presents an overview of our method and highlights some of our results. Since our system combines region proposals with CNNs, we dub the method R-CNN: Regions with CNN features.
The significance of the ImageNet result was vigorously debated during the ILSVRC 2012 workshop. The central issue can be distilled to the following: To what extent do the CNN classification results on ImageNet generalize to object detection results on the PASCAL VOC Challenge? We answer this question by bridging the gap between image classification and object detection. This paper is the first to show that a CNN can lead to dramatically higher object detection performance on PASCAL VOC as compared to systems based on simpler HOG-like features. To achieve this result, we focused on two problems: localizing objects with a deep network and training a high-capacity model with only a small quantity of annotated detection data.
争议:AlexNet的分类结果能多大程度上泛化到检测结果上??
作者在这篇论文里通过迁移学习,实现了训练数据少,但网络模型容量大。
注:关于网络模型容量的思考:
最近实验室每周在开论文分享交流会,听了几次后,对CNN的容量有了个新的认识,举例来说:目标检测网络中,通过anchor的引入,相当于引入了一个目标所有可能存在区域的空间;Siamese跟踪网络中通过score map的引入相当于引入了一个当前帧目标可能存在的所有的位置;双目视差网络中通过一个 cost volum空间的引入,相当于引入了一个两张视图所以视差可能性的空间,总的来说就是,CNN通过大数据和自己模型参数容量大的优势对不同的训练目标产生一个对应的目标的解的搜索空间,然后通过后续的网络结构和合理的损失函数去找到最优解或者次优解。
Unlike image classification, detection requires localizing (likely many) objects within an image. One approach frames localization as a regression problem. However, work from Szegedy et al. [38], concurrent with our own, indicates that this strategy may not fare well in practice (they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method). An alternative is to build a sliding-window detector. CNNs have been used in this way for at least two decades, typically on constrained object categories, such as faces [32, 40] and pedestrians [35]. In order to maintain high spatial resolution, these CNNs typically only have two convolutional and pooling layers. We also considered adopting a sliding-window approach. However, units high up in our network, which has five convolutional layers, have very large receptive fields (195 × 195 pixels) and strides (32×32 pixels) in the input image, which makes precise localization within the sliding-window paradigm an open technical challenge。(5个卷积层 步长32 感受野大 导致sliding-window方式的定位困难??)
Two properties make detection efficient. First, all CNN parameters are shared across all categories. Second, the feature vectors computed by the CNN are low-dimensional when compared to other common approaches, such as spatial pyramids with bag-of-visual-word encodings. The features used in the UVA detection system [39], for example, are two orders of magnitude larger than ours (360k vs. 4k-dimensional). The result of such sharing is that the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) is amortized over all classes. The only class-specific computations are dot products between features and SVM weights and non-maximum suppression. In practice, all dot products for an image are batched into a single matrix-matrix product. The feature matrix is typically 2000×4096 and the SVM weight matrix is 4096×N,where N is the number of classes
作者的意思是:虽然每个ROI没有共享CNN特征提取网络,但是所有的类别都共享了这些特征,然后在SVM时才分类。
Our hypothesis is that this difference in how positives and negatives are defined is not fundamentally important and arises from the fact that fine-tuning data is limited. Our current scheme introduces many “jittered” examples (those proposals with overlap between 0.5 and 1, but not ground truth), which expands the number of positive examples by approximately 30x. We conjecture that this large set is needed when fine-tuning the entire network to avoid overfitting. However, we also note that using these jittered examples is likely suboptimal because the network is not being fine-tuned for precise localization.
This leads to the second issue: Why, after fine-tuning, train SVMs at all? It would be cleaner to simply apply the last layer of the fine-tuned network, which is a 21-way softmax regression classifier, as the object detector. We tried this and found that performance on VOC 2007 dropped from 54.2% to 50.9% mAP. This performance drop likely arises from a combination of several factors including that the definition of positive examples used in fine-tuning does not emphasize precise localization and the softmax classi- fier was trained on randomly sampled negative examples rather than on the subset of “hard negatives” used for SVM training.
主要原因是:CNN训练和SVM训练时对正负样本的定义不一样,CNN定义的比较松,而SVM定义的比较严格,所以如果直接用softmax的话,定位精度下降,导致map下降,还有hard negatives” used for SVM。
可以看到,微调过后,fc6 fc7才能发挥其作用,不微调的话,cnn就像hog一样的特征向量
We start by looking at results from the CNN without fine-tuning on PASCAL, i.e. all CNN parameters were pre-trained on ILSVRC 2012 only. Analyzing performance layer-by-layer (Table 2 rows 1-3) reveals that features from fc7 generalize worse than features from fc6. This means that 29%, or about 16.8 million, of the CNN’s parameters can be removed without degrading mAP. More surprising is that removing both fc7 and fc6 produces quite good results even though pool5 features are computed using only 6% of the CNN’s parameters. Much of the CNN’s representational power comes from its convolutional layers, rather than from the much larger densely connected layers. This finding suggests potential utility in computing a dense feature map, in the sense of HOG, of an arbitrary-sized image by using only the convolutional layers of the CNN. This representation would enable experimentation with sliding-window detectors, including DPM, on top of pool5 features.
There is an interesting relationship between R-CNN and OverFeat: OverFeat can be seen (roughly) as a special case of R-CNN. If one were to replace selective search region proposals with a multi-scale pyramid of regular square regions and change the per-class bounding-box regressors to a single bounding-box regressor, then the systems would be very similar ( modulo some potentially significant differences in how they are trained: CNN detection fine-tuning,using SVMs, etc.). It is worth noting that OverFeat has a significant speed advantage over R-CNN: it is about 9x faster, based on a figure of 2 seconds per image quoted from [34]. This speed comes from the fact that OverFeat’s sliding windows (i.e., region proposals) are not warped at the image level and therefore computation can be easily shared between overlapping windows. Sharing is implemented by running the entire network in a convolutional fashion over arbitrary-sized inputs. Speeding up R-CNN should be possible in a variety of ways and remains as future work
1、RCNN- 将CNN引入目标检测的开山之作
2、R-CNN论文详解