Question

为了尝试为another question的答案编写更为可用的代码版本，我使用了lambda函数来处理单个单元。这是一个正在进行的工作。我的“客户端”语法看起来很漂亮：

// for loop split into 4 threads, calling doThing for each index
parloop(4, 0, 100000000, [](int i) { doThing(i); });

但是，我有一个问题。每当我调用保存的lambda时，它占用了 ton 的CPU时间。 doThing本身是一个空的存根。如果我只是注释掉lambda的内部调用，那么速度将恢复正常（4个线程的4倍加速）。我正在使用std :: function来保存对lambda的引用。

我的问题是 - 有没有更好的方法让stl库在内部管理大型数据集的lambdas，我还没有遇到过？

struct parloop
{
public:
    std::vector<std::thread> myThreads;
    int numThreads, rangeStart, rangeEnd;
    std::function<void (int)> lambda;

    parloop(int _numThreads, int _rangeStart, int _rangeEnd, std::function<void(int)> _lambda) //
        : numThreads(_numThreads), rangeStart(_rangeStart), rangeEnd(_rangeEnd), lambda(_lambda) //
    {
        init();
        exit();
    }

    void init()
    {
        myThreads.resize(numThreads);

        for (int i = 0; i < numThreads; ++i)
        {
            myThreads[i] = std::thread(myThreadFunction, this, chunkStart(i), chunkEnd(i));
        }
    }

    void exit()
    {
        for (int i = 0; i < numThreads; ++i)
        {
            myThreads[i].join();
        }
    }

    int rangeJump()
    {
        return ceil(float(rangeEnd - rangeStart) / float(numThreads));
    }

    int chunkStart(int i)
    {
        return rangeJump() * i;
    }

    int chunkEnd(int i)
    {
        return std::min(rangeJump() * (i + 1) - 1, rangeEnd);
    }

    static void myThreadFunction(parloop *self, int start, int end) //
    {
        std::function<void(int)> lambda = self->lambda;
        // we're just going to loop through the numbers and print them out
        for (int i = start; i <= end; ++i)
        {
            lambda(i); // commenting this out speeds things up back to normal
        }
    }

};

void doThing(int i) // "payload" of the lambda function
{
}

int main()
{
    auto start = timer.now();
    auto stop = timer.now();


    // run 4 trials of each number of threads
    for (int x = 1; x <= 4; ++x)
    {
        // test between 1-8 threads
        for (int numThreads = 1; numThreads <= 8; ++numThreads)
        {
            start = timer.now();

            // this is the line of code which calls doThing in the loop

            parloop(numThreads, 0, 100000000, [](int i) { doThing(i); });

            stop = timer.now();

            cout << numThreads << " Time = " << std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start).count() / 1000000.0f << " ms\n";
            //cout << "\t\tsimple list, time was " << deltaTime2 / 1000000.0f << " ms\n";
        }
    }

    cin.ignore();
    cin.get();
    return 0;
}

Answer 1

我正在使用std::function来保存对lambda的引用。

这是一个可能的问题，因为std::function 不是零运行时成本抽象。它是一个类型擦除的包装器，具有virtual - 在调用operator()时调用成本，并且还可能堆积分配（这可能意味着每次调用都会出现缓存错误）

如果要以不引入额外开销并允许编译器内联的方式存储lambda，则应使用模板参数。这并不总是可行，但可能适合您的用例。例如：

template <typename TFunction>
struct parloop
{
public:
    std::thread **myThreads;
    int numThreads, rangeStart, rangeEnd;
    TFunction lambda;

parloop(TFunction&& _lambda, 
        int _numThreads, int _rangeStart, int _rangeEnd)
    : lambda(std::move(_lambda)), 
      numThreads(_numThreads), rangeStart(_rangeStart), 
      rangeEnd(_rangeEnd) 
{
    init();
    exit();
}

// ...

要推断lambda的类型，可以使用辅助函数：

template <typename TF, typename... TArgs>
auto make_parloop(TF&& lambda, TArgs&&... xs)
{
    return parloop<std::decay_t<TF>>(
        std::forward<TF>(lambda), std::forward<TArgs>(xs)...);
}

用法：

auto p = make_parloop([](int i) { doThing(i); }, 
                      numThreads, 0, 100000000);

我写了一篇与主题相关的文章：
"Passing functions to functions"

它包含一些基准测试，显示与模板参数和其他解决方案相比，为std::function生成了多少程序集。

存储的lambda函数调用非常慢 - 修复还是解决方法？

1 个答案: