Question

我想知道是否有更有效的方法来写a = a + b + c？

 thrust::transform(b.begin(), b.end(), c.begin(), b.begin(), thrust::plus<int>());
 thrust::transform(a.begin(), a.end(), b.begin(), a.begin(), thrust::plus<int>());

这有效，但有没有办法只使用一行代码获得相同的效果？我查看了示例中的saxpy实现，但是它使用了2个向量和一个常量值;

效率更高吗？

struct arbitrary_functor
{
    template <typename Tuple>
    __host__ __device__
    void operator()(Tuple t)
    {
        // D[i] = A[i] + B[i] + C[i];
        thrust::get<3>(t) = thrust::get<0>(t) + thrust::get<1>(t) + thrust::get<2>(t);
    }
};


int main(){

     // allocate storage
    thrust::host_vector<int> A;
    thrust::host_vector<int> B;
    thrust::host_vector<int> C;

    // initialize input vectors
    A.push_back(10);
    B.push_back(10);
    C.push_back(10);

    // apply the transformation
    thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(A.begin(), B.begin(), C.begin(), A.begin())),
                     thrust::make_zip_iterator(thrust::make_tuple(A.end(),   B.end(),   C.end(),   A.end())),
                     arbitrary_functor());

    // print the output
       std::cout << A[0] << std::endl;

    return 0;
}

Answer 1

a = a + b + c具有较低的算术强度（每4个内存操作只有两个算术运算），因此计算将是内存带宽限制。为了比较您提出的解决方案的效率，我们需要测量他们的带宽需求。

第一个解决方案中对transform的每次调用都要求为plus的每次调用提供两个加载和一个存储。因此，我们可以将每个transform调用的费用建模为3N，其中N是向量a，b和{{1}的大小}。由于c有两次调用，因此此解决方案的费用为transform。

我们可以用同样的方式模拟第二种解决方案的成本。每次调用6N都需要三个加载和一个商店。因此，此解决方案的成本模型为arbitrary_functor，这意味着4N解决方案应该比调用for_each两次更有效。当transform很大时，第二个解决方案的执行速度应比第一个快N。

当然，您始终可以将6N/4N = 1.5x与zip_iterator结合使用，以避免对transform进行两次单独调用。

STL推力多向量变换？

1 个答案: