基于傅立叶变换的卷积定理,空间域中的卷积等同于傅立叶域中的逐点乘法(反之亦然)。我实现了一个torch.nn.Conv2d
,它通过在PyTorch中执行逐点乘法而不是卷积(将内核转换为输入大小)来“操作”傅里叶域(如此处所述:https://arxiv.org/pdf/1312.5851.pdf )
我发现它的运行效果不佳,类似于:Keras/Tensorflow - fourier pointwise multiplication implementation of conv2d running 4x slower than spatial convolution
在进行多个基准测试之后,似乎逐点乘法是该操作的主要瓶颈。在基准测试期间,我排除了FFT过程以隔离层的操作(并使用了适当大小的已保存内核)。
这令人困惑,因为在考虑2D卷积(步幅= 1)和逐元素乘法所需的FLOP数量时:
Kernel_H * Kernel_W * C_in * C_out * H * W
C_in * C_out * H * W
例如给定H = 32, W = 60, C_in = 64, C_out = 256
:
16 * 16 * 32 * 60 * 64 * 256 = 8053 MFLOPs
64 * 256 * 32 * 60 = 31.46 MFLOPs
考虑到FLOP的巨大差异,我希望2D卷积会花费更长的时间(读取的GPU已针对点积进行了优化)
我创建了一个简单的脚本来对torch.Tensor
与torch.nn.Conv2d
进行逐点乘法进行基准测试,因为与2D卷积相比,逐元素乘法的性能相当甚至更慢。
以下是在CPU和GPU(带有torch.set_num_threads(1)
的i9900k)上获得2个此类基准测试结果的概述
结果-CPU(i9900k)
(# Kernel Size = 16)
Benchmark Overview (device = cpu):
Number of test iterations: 100
Number of warm-up iterations: 5
Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
Conv2d(in_ch=256, out_ch=64, kernel_size=16): Conv2d([1, 256, 32, 60)
FLOP Estimation:
Conv2d: 8053.06368 MFlops
Pointwise: 31.45728 MFlops
Benchmark Results (device = cpu)
Pointwise: 16.139 +/- 0.786 ms
Conv2d: 12.947 +/- 0.784 ms
-------------------------
(# Kernel Size = 5)
Benchmark Overview (device = cpu):
Number of test iterations: 100
Number of warm-up iterations: 5
Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
Conv2d(in_ch=256, out_ch=64, kernel_size=5): Conv2d([1, 256, 32, 60)
FLOP Estimation:
Conv2d: 786.432 MFlops
Pointwise: 31.45728 MFlops
Benchmark Results (device = cpu)
Pointwise: 36.085 +/- 3.668 ms
Conv2d: 9.344 +/- 0.952 ms
结果-GPU(RTX Titan)
(# Kernel Size = 16)
Benchmark Overview (device = cuda:1):
Number of test iterations: 1000
Number of warm-up iterations: 5
Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
Conv2d(in_ch=256, out_ch=64, kernel_size=16): Conv2d([1, 256, 32, 60)
FLOP Estimation:
Conv2d: 8053.06368 MFlops
Pointwise: 31.45728 MFlops
Benchmark Results (device = cuda:1)
Pointwise: 0.698 +/- 0.031 ms
Conv2d: 2.916 +/- 0.161 ms
------------------------------------
(# Kernel size = 3)
Benchmark Overview (device = cuda:1):
Number of test iterations: 100
Number of warm-up iterations: 5
Pointwise: [1, 256, 32, 60] * [64, 256, 32, 60]
Conv2d(in_ch=256, out_ch=64, kernel_size=3): Conv2d([1, 256, 32, 60)
FLOP Estimation:
Conv2d: 283.11552 MFlops
Pointwise: 31.45728 MFlops
FreqConv: 62.91456 MFlops
Benchmark Results (device = cuda:1)
Pointwise: 0.681 +/- 0.011 ms
Conv2d: 0.126 +/- 0.034 ms
如果我更改H
或W
或频道,结果似乎不会显着改变。但是对于较小的内核,逐点出现的速度会明显变慢。
有人能说明为什么当FLOPs至少大2个数量级时,或者我的思维或代码可能有错误时,逐点乘法为何如此慢?
基准实施
import torch
import numpy as np
from torch import nn
from time import time
torch.set_num_threads(1)
in_ch = 256
out_ch = 64
height = 32
width = 60
kernel_size = 16
warmup = 5
iters = 100
flops_pointwise = (out_ch * in_ch * height * width)
m_flops_conv = (flops_pointwise * kernel_size ** 2) / 1e6
m_flops_pw = (flops_pointwise) / 1e6
# Device to run benchmark on, e.g. 'cpu' or 'cuda:X'
device = 'cpu'
print(f'Benchmark Overview (device = {device}):')
print(f'\tNumber of test iterations: {iters}')
print(f'\tNumber of warm-up iterations: {warmup}')
print(f'\tPointwise: [1, {in_ch}, {height}, {width}] * [{out_ch}, {in_ch}, {height}, {width}]')
print(f'\tConv2d(in_ch={in_ch}, out_ch={out_ch}, kernel_size={kernel_size}): Conv2d([1, {in_ch}, {height}, {width})')
print('\tFLOP Estimation:')
print(f'\t\tConv2d:\t\t {m_flops_conv} MFlops')
print(f'\t\tPointwise:\t {m_flops_pw} MFlops')
print(f'\t\tFreqConv:\t {m_flops_freq_conv} MFlops')
print()
def benchmark(input_gen, operation, warmup=5, iters=1000):
duration = []
for i in range(iters + warmup):
input = input_gen()
start = time() # start timer
with torch.no_grad():
operation(input)
# Sync if using cuda
if device[:4] == 'cuda':
torch.cuda.synchronize(device)
end = time() # end timer
if i < warmup:
continue
duration.append((end - start) * 1e3) # ms
return np.array(duration)
def pointwise(input):
x, y = input
x * y
# Helper methods to generate new data
# for every iteration inside of the benchmark method
def _gen_pw_input(in_ch, out_ch, height, width):
x = torch.rand(1, in_ch, height, width).to(device)
k = torch.randn(out_ch, in_ch, height, width).to(device)
return x, k
gen_pw_input = lambda : _gen_pw_input(in_ch, out_ch, height, width)
def _gen_conv_input(in_ch, out_ch, height, width):
x = torch.rand(1, in_ch, height, width).to(device)
return x
gen_conv_input = lambda : _gen_conv_input(in_ch, out_ch, height, width)
conv2d = nn.Conv2d(in_ch, out_ch, kernel_size=kernel_size).to(device)
pw_res = benchmark(gen_pw_input, pointwise, warmup=warmup, iters=iters)
conv_res = benchmark(gen_conv_input, conv2d, warmup=warmup, iters=iters)
print(f'Benchmark Results (device = {device})')
print('\tPointwise:\t {:.3f} +/- {:.3f} ms'.format(pw_res.mean(), pw_res.std()))
print('\tConv2d:\t\t {:.3f} +/- {:.3f} ms'.format(conv_res.mean(), conv_res.std()))
本征
我还在Eigen(C ++)中实施了一个基本的基准测试,以比较逐元素乘法,它与PyTorch中观察到的结果相似(略慢); PyTorch使用的后端BLAS似乎已优化。