为什么张量流实现的可分卷积比普通卷积慢?

时间:2019-01-18 04:19:59

标签: python-3.x performance tensorflow deep-learning pytorch

我已经测试了在TF中实现的separable_conv2dnormal conv2d的速度,似乎唯一的depthwise_conv2dnormal conv2d快,但是dw_conv2d的性能很差明显。

MobileNet中提到的separable_conv2d,其FLOPs是普通kernel_size=3时的1/9,但是考虑到Memory Access Cost,可分隔的那张不能比普通的快9倍。一种,但在我的实验中,可分离的一种慢得多。

我像这样separable_conv2d is too slow对实验进行建模。在此实验中,当depth_multiply = 1时,separable_conv2d似乎比正常的快,但是当我使用tf.nn来实现它时,如下所示:

IMAGE_SIZE= 512
REPEAT = 100
KERNEL_SIZE = 3
data_format = 'NCHW'
#CHANNELS_BATCH_SIZE =  2048 # channe# ls * batch_size
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

def normal_layers(inputs, nfilter, name=''):
    with tf.variable_scope(name, reuse=tf.AUTO_REUSE):
         shape = inputs.shape.as_list()
         in_channels = shape[1]
         filter = tf.get_variable(initializer=tf.initializers.random_normal,
                                  shape=[KERNEL_SIZE, KERNEL_SIZE, 
                                   in_channels, nfilter], name='weight')
         conv = tf.nn.conv2d(input= inputs, filter=filter, strides= 
                           [1,1,1,1],padding='SAME',data_format=data_format, 
                           name='conv')
    return conv
def sep_layers(inputs, nfilter, name=''):
    with tf.variable_scope(name, reuse=tf.AUTO_REUSE):
         shape= inputs.shape.as_list()
         in_channels = shape[1]
         dw_filter=
         tf.get_variable(initializer=tf.initializers.random_normal,
                         shape=[KERNEL_SIZE, KERNEL_SIZE, 
                         in_channels, 1], name='dw_weight')

         pw_filter = 
         tf.get_variable(initializer=tf.initializers.random_normal,
                                shape=[1,1,in_channels, nfilter], 
                         name='pw_weight')          
         conv = tf.nn.depthwise_conv2d_native(input=inputs,
                                              filter=dw_filter,
                                              strides=[1,1,1,1],
                                              padding='SAME',
                                              data_format=data_format)
         conv = tf.nn.conv2d(input=conv,
                             filter=pw_filter,
                             strides=[1,1,1,1],
                             padding='SAME',
                             data_format=data_format)
    return conv  

每个图层都在100 times中运行,
与链接不同的是,我将batch_size设置为常数10,
channels is in [32, 64, 128]
输入为[batch_size,频道,img_size,img_size] 以及其中的duration如下:

Channels: 32
Normal Conv 0.7769527435302734s, Sep Conv 1.4197885990142822s
Channels: 64
Normal Conv 0.8963277339935303s, Sep Conv 1.5703468322753906s
Channels: 128
Normal Conv 0.9741833209991455s, Sep Conv 1.665834665298462s 

当batch_size为常数时,仅更改通道似乎正常的时间和可分离的时间成本在逐渐增加。

并且在将batch_size * channels设置为常量时
输入形状为[CHANNELS_BATCH_SIZE //通道,通道,imgsize,imgsize]

Channels: 32
Normal Conv 0.871959924697876s, Sep Conv 1.569300651550293s
Channels: 64
Normal Conv 0.909860372543335s, Sep Conv 1.604109525680542s
Channels: 128
Normal Conv 0.9196009635925293s, Sep Conv 1.6144189834594727s

让我感到困惑的是,结果与上面链接的结果不同:sep_conv2d的时间成本没有明显变化。

我的问题是:

  1. 上面的链接和我自己进行的实验有何不同?
  2. 我是新手,所以我的代码中实现separable_conv2d是否有问题?
  3. TF Pytorch 中,如何实现separable_conv2d可以比普通的更快?

任何帮助将不胜感激。预先感谢。

0 个答案:

没有答案