Question

通过一个简化的例子可以更容易地解释这个问题（因为我的真实情况远远不是＆＃34;最小的＆＃34;）：给定...

template <typename T>
void post_in_thread_pool(T&& f)

...函数模板，我想创建一个具有树状递归结构的并行异步算法。我将使用std::count_if作为占位符来编写下面结构的示例。我将使用的策略如下：

如果我检查的范围的长度小于64，我将回退到顺序std::count_if功能。的（0）
如果它大于或等于64，我将在线程池中生成一个作业，该作业在该范围的左半部分进行递归，并计算该范围的右半部分。当前线程。的（1）
- 我将使用原子共享int来＆＃34;等待＆＃34;为了计算两半。的（2）
- 我将使用原子共享int来累积部分结果。的（3）

简化代码：

auto async_count_if(auto begin, auto end, auto predicate, auto continuation)
{
    // (0) Base case:  
    if(end - begin < 64)
    {
        continuation(std::count_if(begin, end, predicate));
        return;
    }

    // (1) Recursive case:
    auto counter = make_shared<atomic<int>>(2); // (2)
    auto cleanup = [=, accumulator = make_shared<atomic<int>>(0) /*(3)*/]
                   (int partial_result)
    {
        *accumulator += partial_result; 

        if(--*counter == 0)
        {
            continuation(*accumulator);
        }
    };

    const auto mid = std::next(i_begin, sz / 2);                

    post_in_thread_pool([=]
    {
        async_count_if(i_begin, mid, predicate, cleanup);
    });

    async_count_if(mid, i_end, predicate, cleanup);
}

然后可以按如下方式使用代码：

std::vector<int> v(512);
std::iota(std::begin(v), std::end(v), 0);

async_count_if{}(std::begin(v), std::end(v), 
/*    predicate */ [](auto x){ return x < 256; }, 
/* continuation */ [](auto res){ std::cout << res << std::endl; });

上面代码中的问题是auto cleanup。由于auto将被推导为cleanup lambda的每个实例化的唯一类型，并且因为cleanup按值捕获cont ...一个无限大的嵌套lambda类型将由于递归而在编译时计算，导致以下错误：

致命错误：递归模板实例化超过最大深度1024

wandbox example

从概念上讲，你可以想到这样建立的类型大致如下：

cont                                // user-provided continuation
cleanup0<cont>                      // recursive step 0
cleanup1<cleanup0<cont>>            // recursive step 1
cleanup2<cleanup1<cleanup0<cont>>>  // recursive step 2
// ...

（！）：请注意，async_count_if 只是一个示例，以显示＆＃34;树状＆＃34;我的真实情况的递归结构。我知道异步count_if可以通过单个原子计数器和sz / 64任务轻松实现。

我想避免错误，尽量减少任何可能的运行时间或内存开销。

一种可能的解决方案是使用std::function<void(int)> cleanup，它允许代码编译和正确运行，但会产生次优汇编并引入额外的动态分配。 wandbox example
- 另一种可能的解决方案是使用std::size_t模板参数+专门化来人为地限制async_count_if::operator()的递归深度 - 不幸的是，这可能会使二进制大小膨胀并且非常不优雅。

令我困扰的是，当我致电async_count_if时，我知道范围的大小：它是std::distance(i_begin, i_end)。如果我知道范围的大小，我还可以推导出所需的计数器和连续数：(2^k - 1)，其中k是递归树的深度。

因此，我认为应该是在async_count_if的第一次调用中预先计算控制结构＆＃34; 的方法，并将其传递给通过引用递归调用。这个＆＃34;控制结构＆＃34;可以为(2^k - 1)原子计数器和(2^k - 1)清理/继续功能包含足够的空间。

我遗憾地找不到一种干净的方法来实现这一点，并决定在这里发布一个问题，因为在开发异步并行递归算法时似乎这个问题应该是常见的。

在不引入不必要开销的情况下处理此问题的优雅方法是什么？

Answer 1

我一定很遗憾，但为什么你需要多个计数器和结构呢？你可以预先计算迭代的总计数（如果你知道你的基本情况）并在整个迭代过程中与累加器共享它，la（必须稍微修改你的简化代码）：

#include <algorithm>
#include <memory>
#include <vector>
#include <iostream>
#include <numeric>
#include <future>

using namespace std;

template <class T>
auto post_in_thread_pool(T&& work)
{
    std::async(std::launch::async, work);
}

template <class It, class Pred, class Cont>
auto async_count_if(It begin, It end, Pred predicate, Cont continuation)
{
    // (0) Base case:  
    if(end - begin <= 64)
    {
        continuation(std::count_if(begin, end, predicate));
        return;
    }

    const auto sz = std::distance(begin, end);
    const auto mid = std::next(begin, sz / 2);                

    post_in_thread_pool([=]
    {
         async_count_if(begin, mid, predicate, continuation);
    });

    async_count_if(mid, end, predicate, continuation);
}

template <class It, class Pred, class Cont>
auto async_count_if_facade(It begin, It end, Pred predicate, Cont continuation)
{
    // (1) Recursive case:
    const auto sz = std::distance(begin, end);
    auto counter = make_shared<atomic<int>>(sz / 64); // (fix this for mod 64 !=0 cases)
    auto cleanup = [=, accumulator = make_shared<atomic<int>>(0) /*(3)*/]
                   (int partial_result)
    {
        *accumulator += partial_result; 

        if(--*counter == 0)
        {
            continuation(*accumulator);
        }
    };

    return async_count_if(begin, end, predicate, cleanup);
}

int main ()
{
    std::vector<int> v(1024);
    std::iota(std::begin(v), std::end(v), 0);

    async_count_if_facade(std::begin(v), std::end(v), 
    /*    predicate */ [](auto x){ return x > 1000; }, 
    /* continuation */ [](const auto& res){ std::cout << res << std::endl; });
}

一些demo

Answer 2

使用原子整数进行同步是共享的可变状态。共享可变状态会杀死并行算法中的性能。您的共享状态通过每个线程共享。

不要那样做。

template<class T>
auto sink_into_pointer( T* target ) {
  return [target](T x){*target=x;};
}
template<class T>
auto sink_into_promise( std::promise<T>& p ) {
  return [&p](T x){p.set_value(x);};
}
void async_count_if(auto begin, auto end, auto predicate, auto continuation) {
  // (0) Base case:  
  if(end - begin < 64)
  {
    continuation(std::count_if(begin, end, std::move(predicate)));
    return;
  }

  std::promise< int > sub_count;
  auto sub_count_value = sub_count.get_future();

  auto sub_count_task = sink_into_promise(sub_count);
  // (1) Recursive case:
  const auto mid = std::next(i_begin, sz / 2);        

  post_in_thread_pool(
    [sub_count_task]()mutable
    {
      async_count_if(i_begin, mid, predicate, sub_count_task);
    }
  );

  int second_half = 0;
  auto second_sub_count = sink_into_pointer(&second_half);

  async_count_if(mid, i_end, predicate, second_sub_count);

  continuation( second_half + sub_count_value.get() );
}

在这种情况下，线程之间唯一的共享状态是通过packaged_task和线程池管理器返回的值。

编写并行代码时，您的目标应该是最大化并行性，而不是在给定线程中最大化速度。共享资源等的争用将导致比每个线程执行一次函数指针更糟糕的扩展问题。

Answer 3

您可以使用以下内容解决模板递归问题：

#include <algorithm>
#include <future>
#include <iostream>
#include <memory>
#include <numeric>
#include <vector>

using namespace std;

template <class T> auto post_in_thread_pool(T &&work) {
  std::async(std::launch::async, work);
}

template <class Terminal_T> struct Accumulator {
  std::shared_ptr<atomic<int>> counter;
  std::shared_ptr<atomic<int>> accumulator;
  Terminal_T func;
  std::shared_ptr<Accumulator> parent;

  void operator()(int value) {
    *accumulator += value;
    if (--*counter == 0) {
      if (parent)
        (*parent)(*accumulator);
      else
        func(*accumulator);
    }
  }
};

template <class T>
auto make_shared_accumulator(T func, int nb_leaves,
                             std::shared_ptr<Accumulator<T>> parent = nullptr) {
  return make_shared<Accumulator<T>>(
      Accumulator<T>{make_shared<atomic<int>>(nb_leaves),
                     make_shared<atomic<int>>(0), func, parent});
}

template <class Begin_T, class End_T, class Predicate_T, class Continuation_T>
auto async_count_if(Begin_T begin, End_T end, Predicate_T predicate,
                    Continuation_T continuation) {
  auto sz = end - begin;

  // (0) Base case:
  if (sz < 64) {
    (*continuation)(std::count_if(begin, end, predicate));
    return;
  }

  // (1) Recursive case:
  auto counter = make_shared<atomic<int>>(2); // (2)
  auto cleanup = make_shared_accumulator(continuation->func, 2, continuation);
  const auto mid = std::next(begin, sz / 2);

  post_in_thread_pool([=] { async_count_if(begin, mid, predicate, cleanup); });

  async_count_if(mid, end, predicate, cleanup);
}

int main() {
  std::vector<int> v(512);
  std::iota(std::begin(v), std::end(v), 0);

  std::vector<std::future<size_t>> results;

  auto res_func = [](int res) { std::cout << res << std::endl; };
  async_count_if(std::begin(v), std::end(v),
                 /*    predicate */ [](auto x) { return x < 256; },
                 /* continuation */
                 make_shared_accumulator(res_func, 1));
}

On Coliru。它并不完美，使用引用包装器可以避免许多无用的副本（可能还有其他优化方法）完成），但我试图保持示例解释不仅仅是优化。

问题在于，使用几度不同的累加器来适应更复杂的数据流并不容易，我认为这是你的真实情况。

您正在尝试实施并行化数据计算管道。这不是一个可以用语法技巧解决的简单问题。您需要一种线程安全的方式来在您的任务之间进行通信，既不是递归也不是阻塞线程。

仅标准库不能为您提供足够的帮助。您可以做的最好的事情是基于期货的不稳定实施。

要摆脱这个陷阱，你需要更多的工具。您可以考虑使用TensorFlow来实现您的计算模型。您还可以使用实验框架，例如Boson或RaftLib（此文件中尚未实现多线程）。或者实现自己的，但要注意，要做到这一点需要做很多工作。

避免并行递归异步算法中的递归模板实例化溢出

3 个答案: