Question

我正在尝试使用F＃中的C ++ AMP库作为一种使用GPU并行工作的方法。但是，我得到的结果似乎并不直观。

在C ++中，我创建了一个带有一个函数的库，它使用AMP对数组中的所有数字进行平方：

extern "C" __declspec ( dllexport ) void _stdcall square_array(double* arr, int n)
{
// Create a view over the data on the CPU
    array_view<double,1> dataView(n, &arr[0]);

// Run code on the GPU
    parallel_for_each(dataView.extent, [=] (index<1> idx) restrict(amp)
    {
        dataView[idx] = dataView[idx] * dataView[idx];
    });

// Copy data from GPU to CPU
    dataView.synchronize();
}

（代码改编自MSDN上的Igor Ostrovsky的blog。）

然后我编写了以下F＃来比较任务并行库（TPL）和AMP：

// Print the time needed to run the given function
let time f =
    let s = new Stopwatch()
    s.Start()
    f ()
    s.Stop()
    printfn "elapsed: %d" s.ElapsedTicks

module CInterop =
    [<DllImport("CPlus", CallingConvention = CallingConvention.StdCall)>]
    extern void square_array(float[] array, int length)

let options = new ParallelOptions()
let size = 1000.0
let arr = [|1.0 .. size|]
// Square the number at the given index of the array
let sq i =
    do arr.[i] <- arr.[i] * arr.[i]
    ()
// Square every number in the array using TPL
time (fun() -> Parallel.For(0, arr.Length - 1, options, new Action<int>(sq)) |> ignore)

let arr2 = [|1.0 .. size|]
// Square every number in the array using AMP
time (fun() -> CInterop.square_array(arr2, arr2.Length))

如果我将数组大小设置为像10这样的普通数字，则需要TPL~22K标记才能完成，AMP~10K标记。这就是我的期望。据我所知，GPU（因此AMP）应该更适合这种情况，在这种情况下，工作分为非常小的部分，而不是TPL。

但是，如果我将数组大小增加到1000，则TPL现在需要大约30K的滴答，AMP需要大约70K的小时。它从那里变得更糟。对于大小为100万的阵列，AMP的使用时间是TPL的近1000倍。

由于我希望GPU（即AMP）能够更好地完成这项任务，我想知道我在这里缺少什么。

据我所知，我的显卡是1GB的GeForce 550 Ti，而不是懒散。我知道使用PInvoke调用AMP代码会产生开销，但我希望这是一个在更大的数组大小上摊销的固定成本。我相信数组是通过引用传递的（虽然我可能是错的），所以我不希望任何与复制相关的成本。

感谢大家的建议。

Answer 1

在GPU和CPU之间来回传输数据需要时间。您最有可能在此测量PCI Express总线带宽。平移1M的浮子对于GPU来说是件小事。

使用Stopwach类来衡量AMP的性能也不是一个好主意，因为GPU调用可能是异步发生的。在你的情况下它是好的，但如果你只测量计算部分（parallel_for_each），这将无效。我认为你可以使用D3D11性能计数器。

C＃AMP库是否适用于F＃？

1 个答案: