Question

我想讨论使用名为GoggleNet的Caffe模型进行特征提取。我指的是这篇论文"End to end people detection in crowded scenes"。对于那些熟悉caffe的人，应该能够应对我的疑问。

本文有自己的library使用Python，我也浏览了库，但无法应对论文中提到的一些要点。

输入图像通过GoogleNet till inception_5b/output图层传递。

然后在15x20x1024中将输出形成为多维数组。因此，每个1024向量表示64x64区域中心的边界框。由于它是50％重叠，因此640x480图像有15x20矩阵，每个单元的长度为1024矢量的第三维。

我的查询是

（1）如何获得15x20x1024阵列输出？

（2）这个1x1x1024数据如何代表图像中的64x64区域？源代码中有一个描述为

"""Takes the output from the decapitated googlenet and transforms the output
    from a NxCxWxH to (NxWxH)xCx1x1 that is used as input for the lstm layers.
    N = batch size, C = channels, W = grid width, H = grid height."""

使用Python中的函数

实现转换

def generate_intermediate_layers(net):
    """Takes the output from the decapitated googlenet and transforms the output
    from a NxCxWxH to (NxWxH)xCx1x1 that is used as input for the lstm layers.
    N = batch size, C = channels, W = grid width, H = grid height."""

    net.f(Convolution("post_fc7_conv", bottoms=["inception_5b/output"],
                      param_lr_mults=[1., 2.], param_decay_mults=[0., 0.],
                      num_output=1024, kernel_dim=(1, 1),
                      weight_filler=Filler("gaussian", 0.005),
                      bias_filler=Filler("constant", 0.)))
    net.f(Power("lstm_fc7_conv", scale=0.01, bottoms=["post_fc7_conv"]))
    net.f(Transpose("lstm_input", bottoms=["lstm_fc7_conv"]))

我无法处理该部分，因为每个1x1x1024表示边界框矩形的大小。

Answer 1

由于您正在查看网络中非常深的1x1单元格，因此它的有效recptive field非常大，可能（并且可能是）原始图像中的64x64像素。
也就是说，"inception_5b/output"中的每个要素都会受到输入图像中64x64像素的影响。

使用caffe模型进行特征提取

1 个答案: