在Numpy阵列上使用Pycuda进行GPU阵列乘法

时间:2019-06-27 13:56:49

标签: numpy matrix-multiplication pycuda elementwise-operations n-dimensional

我试图通过制作相似的GPU数组并执行操作来实现两个numpy数组的按元素乘法。 但是,结果执行时间比原始的numpy点向乘法慢得多。我希望使用GPU能够获得良好的加速。 zz0是complex128类型的(64,256,16)形状numpy数组,而xx0是float64类型的(16,151)形状numpy数组。有人可以帮我弄清楚我在实现方面的错吗:

import sys
import numpy as np
import matplotlib.pyplot as plt
import pdb
import time

import pycuda.driver as drv
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda.elementwise import ElementwiseKernel
import pycuda.gpuarray as gpuarray
import pycuda.cumath
import skcuda.linalg as linalg

linalg.init()

# Function for doing a point-wise multiplication using GPU
def calc_Hyp(zz,xx):
    zz_stretch = np.tile(zz, (1,1,1,xx.shape[3]))
    xx_stretch = np.tile(xx, (zz.shape[0],zz.shape[1],1,1))
    zzg = gpuarray.to_gpu(zz_stretch)
    xxg = gpuarray.to_gpu(xx_stretch)
    zz_Hypg = linalg.multiply(zzg,xxg)
    zz_Hyp = zz_Hypg.get()
    return zz_Hyp


zz0 = np.random.uniform(10.0/5000, 20000.0/5000, (64,256,16)).astype('complex128')
xx0 = np.random.uniform(10.0/5000, 20000.0/5000, (16,151)).astype('float64')

xx0_exp = np.exp(-1j*xx0)

t1 = time.time()

#Using GPU for the calculation
zz0_Hyp = calc_Hyp(zz0[:,:,:,None],xx0_exp[None,None,:,:])
#np.save('zz0_Hyp',zz0_Hyp)

t2 = time.time()
print('Time taken with GPU:{}'.format(t2-t1))

#Original calculation
zz0_Hyp_actual = zz0[:,:,:,None]*xx0_exp[None,None,:,:]
#np.save('zz0_Hyp_actual',zz0_Hyp_actual)

t3 = time.time()
print('Time taken without GPU:{}'.format(t3-t2))

1 个答案:

答案 0 :(得分:0)

第一个问题是您的计时指标不正确。

Linalg即时编译cuda模块,运行时可能会看到代码正在编译 。我对您的代码进行了一些小的修改,以减少要乘的数组的大小,但是无论如何,经过两次运行而没有其他改进之后,我看到了性能的大幅提高,例如:

Time taken with GPU:2.5476348400115967
Time taken without GPU:0.16627931594848633

vs

Time taken with GPU:0.8741757869720459
Time taken without GPU:0.15836167335510254

但是,这仍然比CPU版本慢得多。我要做的第二件事是根据实际计算发生的位置给出更准确的时序。您没有使用numpy版本,所以请不要将其设置为cuda版本:

REAL Time taken with GPU:0.6461708545684814

您还复制到GPU,并将其包括在计算中,但它本身花费的时间并不短,因此,请删除它:

t1 = time.time()
zz_Hypg = linalg.multiply(zzg,xxg)
t2 = time.time()
...
REAL Time taken with GPU:0.3689603805541992

哇,那贡献很大。但是我们仍然比numpy版本慢吗?为什么?

还记得我说过numpy不平铺吗?它不会完全复制内存 进行广泛的转换。要获得真正的速度,您必须:

  • 不平铺
  • 广播尺寸
  • 在内核中实现。

Pycuda提供了用于内核实现的实用程序,但其GPU阵列不提供广播。本质上,您需要做的是(免责声明:我没有测试过,可能存在错误,这只是用来演示内核的外观):

#include <pycuda-complex.hpp>
//KERNEL CODE
constexpr unsigned work_tile_dim = 32
//instruction level parallelism factor, how much extra work to do per thread, may be changed but effects the launch dimensions. thread group size should be (tile_factor, tile_factor/ilp_factor)
constexpr unsigned ilp_factor = 4
//assuming c order: 
//    x axis contiguous out, 
//    y axis contiguous in zz, 
//    x axis contiguous in xx
// using restrict because we know that all pointers will refer to different parts of memory. 
__global__ 
void element_wise_multiplication(
    pycuda::complex<double>* __restrict__ array_zz, 
    pycuda::complex<double>* __restrict__ array_xx,
    pycuda::complex<double>* __restrict__ out_array,
    unsigned array_zz_w, /*size of w,z,y, dimensions used in zz*/
    unsigned array_zz_z,
    unsigned array_zz_xx_y,/*size of y,x, dimensions used in xx, but both have same y*/
    unsigned array_xx_x){


    // z dimensions in blocks often have restrictions on size that can be fairly small, and sometimes can cause performance issues on older cards, we are going to derive x,y,z,w index from just the x and y indicies instead. 
    unsigned x_idx = blockIdx.x * (work_tile_dim) + threadIdx.x
    unsigned y_idx = blockIdx.y * (work_tile_dim) + threadIdx.y
    //blockIdx.z stores both z and w and should not over shoot, and aren't used
    //shown for the sake of how to get these dimensions. 
    unsigned z_idx = blockIdx.z % array_zz_z;
    unsigned w_idx = blockIdx.z / array_zz_z;
    //we already know this part of the indexing calculation. 
    unsigned out_idx_zw = blockIdx.z * (array_zz_xx_y * array_xx_z);
    // since our input array is actually 3D, this is a different calcualation
    unsigned array_zz_zw = blockIdx.z * (array_zz_xx_y)
    //ensures if our launch dimensions don't exactly match our input size, we don't 
    //accidently access out of bound memory, while branching can be bad, this isn't 
    // because 99.999% of the time no branch will occur and our instruction pointer 
    //will be the same per warp, meaning virtually zero cost. 
    if(x_idx < array_xx_x){
        //moving over y axis to coalesce memory accesses in the x dimension per warp. 
        for(int i = 0; i < ilp_factor; ++i){
            //need to also check y, these checks are virtually cost-less 
            // because memory access dominates time in such simple calculations, 
            // and arithmetic will be hidden by overlapping execution 
            if((y_idx+i) < array_zz_xx_y){
                //splitting up calculation for simplicity sake
                out_array_idx = out_idx_zw+(y_idx+i)*array_xx_x + x_idx;
                array_zz_idx = array_zz_zw + (y_idx+i);
                array_xx_idx = ((y_idx+i) * array_xx_x) + x_idx;
                //actual final output. 
                out_array[out_array_idx] = array_zz[array_zz_idx] * array_xx[array_xx_idx];
            }
        }
    }
}

您必须将启动尺寸设置为类似

thread_dim = (work_tile_dim, work_tile_dim/ilp_factor) # (32,8)
y_dim = xx0.shape[0]
x_dim = xx0.shape[1]
wz_dim = zz0.shape[0] * zz0.shape[1]
block_dim = (x_dim/work_tile_dim, y_dim/work_tile_dim, wz_dim)

还有一些其他的优化可以利用:

  • 将全局内存访问存储在内核内部共享内存中的工作区中,这确保对zz0s“ y”的访问,但实际上x维度在放入共享内存时会合并在一起,从而提高性能,然后从共享内存访问(合并并不重要,但银行冲突却很重要)。有关如何处理此类银行冲突的信息,请参见here

  • 不是计算eulers公式,而是将double扩展为复杂的double,而是将其扩展到内核本身内,使用sincos(-x, &out_sin, &out_cos)来达到相同的结果,但使用的内存带宽更少(请参见{{ 3}})。

但是请注意,即使这样做,也可能无法提供您想要的性能(尽管可能仍会更快),除非您使用的是具有全双精度单位的高端GPU,而大多数GPU都没有(大多数时间)。双精度浮点单元占用大量空间,而且由于GPU用于图形处理,因此双精度浮点数用处不大。如果您想获得比浮点更高的精度,但想利用浮点硬件,而吞吐量却没有达到1/8到1/32的两倍,则可以使用here中使用的技术来实现这一点。 gpu,让您接近1/2到1/3的吞吐量。