Question

请注意我是CUDA的绝对初学者，以下所有内容都是未经测试的伪代码。我来自JavaScript，我的C ++也非常生疏，所以我为我的无知而道歉：）

我正在尝试使用CUDA来回溯许多不同的外汇策略。

使用Thrust，我从类（伪代码）实例化了1000个对象：

#include <stdio.h>
#include <thrust/device_ptr.h>
#include <thrust/device_new.h>

#define N 1000

typedef struct dataPoint {
    ...
} dataPoint;

class Strategy {
    public:
        __device__ __host__ Strategy(...) {
            ...
        }

        __device__ __host__ void backtest(dataPoint data) {
            ...
        }
};

int main() {
    dataPoint data[100000];
    thrust::device_ptr<Strategy> strategies[1000];
    int i;

    // Instantiate 5000 strategies.
    for (i=0; i<1000; i++) {
        thrust::device_ptr<Strategy> strategies[i] = thrust::device_new<Strategy>(...);
    }

    // Iterate over all 100000 data points.
    for (i=0; i<100000; i++) {
        // Somehow run .backtest(data[j]) on each strategy here.
        // i.e. Run backtest() in parallel for all 1000
        // strategy objects here.
    }
}

现在让我们说我想对.backtest()中每个项目的每个对象运行data方法。程序上我会做以下事情：

// Iterate over all 100000 data points.
for (j=0; j<100000; j++) {
    // Iterate over all 1000 strategies.
    for (i=0; i<1000; i++) {
        strategies[i].backtest(data[j]);
    }
}

我如何使用CUDA完成此操作，以便.backtest()并行运行所有策略，每次迭代j通过数据？

如果我必须彻底重新解构所有事情，那就这样吧 - 我对任何必要的事情持开放态度。如果课程不可能，那就这样吧。

Answer 1

典型的推力代码经常使用某些C ++习语（例如仿函数），所以如果你的C ++生锈，你可能想要阅读有关C ++仿函数的内容。您可能还想查看thrust quick start guide以讨论函子以及我们目前使用的花式迭代器。

一般来说，至少从表达式的角度来看，我认为推力非常适合您的问题描述。鉴于这些类型问题的推力表达的灵活性，可能有很多方法可以给猫皮肤涂抹。我会试着提出一些关于＆＃34;关闭＆＃34;你的伪代码。但毫无疑问，有很多方法可以实现这一点。

首先，我们通常会尝试避免for循环。这些将非常慢，因为它们通常涉及每次迭代时的主机代码和设备代码交互（例如，在每次迭代时调用CUDA内核）。如果可能的话，我们更喜欢使用推力算法，因为这些算法通常会翻译成＃34;引擎盖下的一个或几个CUDA内核。

推力中最基本的算法之一是transform。它有各种各样的风格，但基本上是逐个输入数据并对其应用任意操作。

使用基本推力变换操作，我们可以初始化您的数据以及您的策略，而无需使用for循环。我们将为每种类型的对象（dataPoint，Strategy）构建适当长度的设备向量，然后我们将使用thrust::transform初始化每个向量。

这使我们完成了针对每个dataPoint执行每个Strategy的任务。理想情况下，我们也希望并行执行此操作;不仅仅是针对您建议的每个for循环迭代，而且每个 Strategy针对每个 dataPoint，所有＆＃34; at一旦＆＃34; （即在单个算法调用中）。

实际上，我们可以考虑一个矩阵，一个轴由dataPoint（在您的示例中为100000）组成，另一个轴由Strategy组成（在您的示例中为1000维）。对于此矩阵中的每个点，我们设想它将Strategy的应用结果与dataPoint保持一致。

在推力方面，我们通常更喜欢将这种2D概念视为一维。因此，我们的结果空间等于dataPoint乘以Strategy的数量的乘积。我们将创建一个此大小的result device_vector（在您的示例中为100000 * 1000）来保存结果。

为了示范，由于您没有给出关于您想要做的算术类型的指导，我们将假设以下内容：

对Strategy应用dataPoint的结果是float。
dataPoint是一个由int（dtype - 本示例忽略）和float（dval）组成的结构。 dval将包含dataPoint(i)，1.0f + i*10.0f。
Strategy由multiplier和adder组成，具体如下：
```
Strategy(i) = multiplier(i) * dval + adder(i);
```
对Strategy应用dataPoint包括检索与dval相关联的dataPoint，并将其替换为上面第3项给出的等式。此等式在类backtest的{{1}}方法中捕获。 Strategy方法将backtest类型的对象作为其参数，从中检索相应的dataPoint。

我们需要涵盖更多的概念。 2D结果矩阵的一维实现将要求我们提供适当的索引方法，以便在2D矩阵中的每个点处，给定其线性维度，我们可以确定哪个dval和哪个Strategy将用于计算该点的dataPoint。在推力方面，我们可以使用花式迭代器的组合来做到这一点。

简而言之，从＆＃34; inside out＆＃34;开始，我们将从变换迭代器开始，它采用索引映射函子和result提供的线性序列，以创建一个每个索引的映射（每个矩阵维度）。每个映射函子中的算术将thrust::counting_iterator的线性索引转换为矩阵的行和列的适当重复索引。给定此转换迭代器以创建重复的行或列索引，我们将该索引传递给置换迭代器，该迭代器为指示的每个行/列选择适当的result或dataPoint。然后将这两个项目（Strategy，dataPoint）压缩在Strategy中。然后将zip_iterator传递给zip_iterator仿函数，该仿函数实际上计算应用于给定run_strat的给定Strategy。

以下是概述上述概念的示例代码：

dataPoint

注意：

如上所述，这是一种可能的实现。我认为它应该是合理的＆＃34;效率很高，但推力可能会更有效。在尝试解决优化之前，可能需要对您的实际策略和回测方法进行更全面的分析。
最后的#include <iostream> #include <thrust/device_vector.h> #include <thrust/host_vector.h> #include <thrust/transform.h> #include <thrust/iterator/counting_iterator.h> #include <thrust/iterator/permutation_iterator.h> #include <thrust/iterator/zip_iterator.h> #include <math.h> #define TOL 0.00001f // number of strategies #define N 1000 // number of data points #define DSIZE 100000 // could use int instead of size_t here, for these problem dimensions typedef size_t idx_t; struct dataPoint { int dtype; float dval; }; class Strategy { float multiplier; float adder; idx_t id; public: __device__ __host__ Strategy(){ id = 0; multiplier = 0.0f; adder = 0.0f; } __device__ __host__ Strategy(idx_t _id) { id = _id; multiplier = 1.0f + ((float)id)/(float)N; adder = (float)id; } __device__ __host__ float backtest(dataPoint data) { return multiplier*data.dval+adder; } }; // functor to initialize dataPoint struct data_init { __host__ __device__ dataPoint operator()(idx_t id){ dataPoint temp; temp.dtype = id; temp.dval = 1.0f + id * 10.0f; return temp; } }; // functor to initialize Strategy struct strat_init { __host__ __device__ Strategy operator()(idx_t id){ Strategy temp(id); return temp; } }; // functor to "test" a Strategy against a dataPoint, using backtest method struct run_strat { template <typename T> __host__ __device__ float operator()(idx_t id, T t){ return (thrust::get<0>(t)).backtest(thrust::get<1>(t)); } }; // mapping functor to generate "row" (Strategy) index from linear index struct strat_mapper : public thrust::unary_function<idx_t, idx_t> { __host__ __device__ idx_t operator()(idx_t id){ return id/DSIZE; } }; // mapping functor to generate "column" (dataPoint) index from linear index struct data_mapper : public thrust::unary_function<idx_t, idx_t> { __host__ __device__ idx_t operator()(idx_t id){ return id%DSIZE; } }; int main() { // initialize data thrust::device_vector<dataPoint> data(DSIZE); thrust::transform(thrust::counting_iterator<idx_t>(0), thrust::counting_iterator<idx_t>(DSIZE), data.begin(), data_init()); // initialize strategies thrust::device_vector<Strategy> strategies(N); thrust::transform(thrust::counting_iterator<idx_t>(0), thrust::counting_iterator<idx_t>(N), strategies.begin(), strat_init()); // test each data point against each strategy // Somehow run .backtest(data[j]) on each strategy here. // i.e. Run backtest() in parallel for all 1000 // strategy objects here. // allocate space for results for each datapoint against each strategy thrust::device_vector<float> result(DSIZE*N); thrust::transform(thrust::counting_iterator<idx_t>(0), thrust::counting_iterator<idx_t>(DSIZE*N), thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(strategies.begin(), thrust::make_transform_iterator(thrust::counting_iterator<idx_t>(0), strat_mapper())), thrust::make_permutation_iterator(data.begin(), thrust::make_transform_iterator(thrust::counting_iterator<idx_t>(0), data_mapper())))), result.begin(), run_strat()); // validation // this would have to be changed if you change initialization of dataPoint // or Strategy thrust::host_vector<float> h_result = result; for (int j = 0; j < N; j++){ float m = 1.0f + (float)j/(float)N; float a = j; for (int i = 0; i < DSIZE; i++){ float d = 1.0f + i*10.0f; if (fabsf(h_result[j*DSIZE+i] - (m*d+a))/(m*d+a) > TOL) {std::cout << "mismatch at: " << i << "," << j << " was: " << h_result[j*DSIZE+i] << " should be: " << m*d+a << std::endl; return 1;}}} return 0; }操作使用transform作为第一个参数（和第二个参数），但这实际上被忽略了，＆＃34; dummy＆＃34;用法，只是适当地调整问题的大小。它可以通过更简单的实现来消除，但在我看来，最简单的方法（不会进一步混淆代码）将使用C ++ 11 counting_iterator来定义auto，然后通过它本身，加上它的偏移版本，到zip_iterator，使用只需一个输入向量而不是2的版本。我不认为这应该会对性能产生很大影响，我觉得这个稍微容易解析，但也许不是。

如何在CUDA中为多个Thrust对象成员函数调用内核函数？

1 个答案: