Question

考虑以下c ++代码：

#include "threadpool.hpp"
#include <chrono>
#include <list>
#include <iostream>
#include <cmath>

int loop_size;

void process(int num) {
    double x = 0;
    double sum = 0;
    for(int i = 0; i < loop_size; ++i) {
        x += 0.0001;
        sum += sin(x) / cos(x) + cos(x) * cos(x);
    }
}

int main(int argc, char* argv[]) {
    if(argc < 3) {
        std::cerr << argv[0] << " [thread_pool_size] [threads] [sleep_time]" << std::endl;
        exit(0);
    }
    thread_pool* pool = nullptr;
    int th_count = std::atoi(argv[1]);
    if(th_count != 0) {
        pool = new thread_pool(th_count);
    }
    loop_size = std::stoi(argv[3]);
    int max = std::stoi(argv[2]);
    auto then = std::chrono::steady_clock::now();
    std::list<std::thread> ths;
    if(th_count == 0) {
        for(int i = 0; i < max; ++i) {
            ths.emplace_back(&process, i);
        }
        for(std::thread& t : ths) {
            t.join();
        }
    } else {
        for(int i = 0; i < max; ++i) {
            pool->enqueue(std::bind(&process, i));
        }
        delete pool;
    }
    int diff = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - then).count();
    std::cerr << "Time: " << diff << '\n';
    return 0;
}

并且"threadpool.hpp"是this github repo的修改版本，可以使用here

我在机器（Corei7-6700）和88核服务器（2x Xeon E5-2696 v4）上编译了以上代码。结果我无法解释。

这是我运行代码的方式：

tp <threadpool size> <number of threads> <iterations>

相同的代码在更快的计算机上运行速度更慢！我的本地计算机上有8个核心，远程服务器上有88个核心，这些都是结果：（最后两列表示每台计算机上完成的平均时间（以毫秒为单位）

+============+=========+============+=============+====================+
| Threadpool | Threads | Iterations | Corei7-6700 | 2x Xeon E5-2696 v4 |
+============+=========+============+=============+====================+
|        100 |  100000 |       1000 |        1300 |               6000 |
+------------+---------+------------+-------------+--------------------+
|       1000 |  100000 |       1000 |        1400 |               5000 |
+------------+---------+------------+-------------+--------------------+
|      10000 |  100000 |       1000 |        1470 |               3400 |
+------------+---------+------------+-------------+--------------------+

似乎有更多的内核会使代码运行缓慢。因此，我将服务器（taskset）上的CPU关联性降低到8个内核，然后再次运行代码：

taskset 0-7 tp <threadpool size> <number of threads> <iterations>

这是新数据：

+============+=========+============+=============+====================+
| Threadpool | Threads | Iterations | Corei7-6700 | 2x Xeon E5-2696 v4 |
+============+=========+============+=============+====================+
|        100 |  100000 |       1000 |        1300 |                900 |
+------------+---------+------------+-------------+--------------------+
|       1000 |  100000 |       1000 |        1400 |               1000 |
+------------+---------+------------+-------------+--------------------+
|      10000 |  100000 |       1000 |        1470 |               1070 |
+------------+---------+------------+-------------+--------------------+

我已经在32核Xeon和22核旧Xeon机器上测试了相同的代码，并且模式相似：内核更少，使多线程代码运行得更快。但是为什么呢？

重要说明：这是在这里解决我原来的问题的努力：

Why having more and faster cores makes my multithreaded software slower?

注意：

所有机器上的操作系统和编译器都相同：运行内核4.0.9-3、6.3.0 20170516的debian 9.0 amd64
没有其他缺点，默认优化为：g++ ./threadpool.cpp -o ./tp -lpthread

Answer 1

通常，对于像这样的受CPU约束的代码，不应期望在池中运行的线程多于拥有执行内核的线程。

例如，比较具有N核套接字的1, 2, ... N/2 ... N ... N*2线程池可能很有趣。具有10 * N线程的池实际上只是测试调度程序在负载下的行为。

然后，通常来说，您还需要了解每个任务的开销：将您的工作分解成更多的任务，创建，销毁和同步对这些任务的访问将花费更多的时间。对于固定的工作量，更改子任务的大小是一种很好的方法。

最后，它有助于您了解所使用的物理体系结构。 NUMA服务器平台使用其两个插槽可以完成的工作正好是同一单个CPU可以单独完成的两倍-如果每个插槽只能访问自己的直接连接的内存。跨QPI传输数据后，性能就会下降。在QPI上反弹像您的互斥锁这样的竞争激烈的高速缓存行可能会使整个过程变慢。

类似地，如果您有N个核，并希望在池中运行N个线程-您知道它们是物理核还是超线程逻辑核？如果它们是HT，您是否知道您的线程是否能够全速运行，或者它们会争夺有限的共享资源？

Answer 2

您正在将大量的工作人员加入线程池，这些线程需要很少的时间来执行。因此，线程池的实现（实际工作是 not ）特别是其互斥锁处理争用的方式使您陷入瓶颈。我尝试用folly::CPUThreadPoolExecutor替换thread_pool，这有帮助：

thread_pool version:
2180 ms | thread_pool_size=100   num_workers=100000 loop_size=1000 affinity=0-23
2270 ms | thread_pool_size=1000  num_workers=100000 loop_size=1000 affinity=0-23
2400 ms | thread_pool_size=10000 num_workers=100000 loop_size=1000 affinity=0-23
 530 ms | thread_pool_size=100   num_workers=100000 loop_size=1000 affinity=0-7
1930 ms | thread_pool_size=1000  num_workers=100000 loop_size=1000 affinity=0-7
2300 ms | thread_pool_size=10000 num_workers=100000 loop_size=1000 affinity=0-7
folly::CPUThreadPoolExecutor version:
 830 ms | thread_pool_size=100   num_workers=100000 loop_size=1000 affinity=0-23
 780 ms | thread_pool_size=1000  num_workers=100000 loop_size=1000 affinity=0-23
 800 ms | thread_pool_size=10000 num_workers=100000 loop_size=1000 affinity=0-23
 880 ms | thread_pool_size=100   num_workers=100000 loop_size=1000 affinity=0-7
1130 ms | thread_pool_size=1000  num_workers=100000 loop_size=1000 affinity=0-7
1120 ms | thread_pool_size=10000 num_workers=100000 loop_size=1000 affinity=0-7

我建议您（1）在每个线程中做更多的工作；（2）使用与CPU一样多的线程；（3）使用更好的线程池。让我们将thread_pool_size设置为CPU数量，并将loop_size乘以10：

thread_pool version:
1880 ms | thread_pool_size=24 num_workers=100000 loop_size=10000 affinity=0-23
4100 ms | thread_pool_size=8  num_workers=100000 loop_size=10000 affinity=0-7
folly::CPUThreadPoolExecutor version:
1520 ms | thread_pool_size=24 num_workers=100000 loop_size=10000 affinity=0-23
2310 ms | thread_pool_size=8  num_workers=100000 loop_size=10000 affinity=0-7

请注意，通过将每个线程的工作量增加10倍，我们实际上使thread_pool版本的速度更快，而folly::CPUThreadPoolExecutor版本仅花费了2倍的时间。让我们将loop_size乘以10倍以上：

thread_pool version:
28695 ms | thread_pool_size=24 num_workers=100000 loop_size=100000 affinity=0-23
81600 ms | thread_pool_size=8  num_workers=100000 loop_size=100000 affinity=0-7
folly::CPUThreadPoolExecutor version:
 6830 ms | thread_pool_size=24 num_workers=100000 loop_size=100000 affinity=0-23
14400 ms | thread_pool_size=8  num_workers=100000 loop_size=100000 affinity=0-7

对于folly::CPUThreadPoolExecutor来说，结果是不言而喻的：在每个线程中做更多的工作可使您更接近并行性的真正线性收益。 thread_pool似乎不能胜任该任务；它无法正确处理这种规模的互斥量竞争。

这是我用来测试的代码（与gcc 5.5一起编译，全面优化）：

#include <chrono>
#include <cmath>
#include <iostream>
#include <memory>
#include <vector>

#define USE_FOLLY 1

#if USE_FOLLY
#include <folly/executors/CPUThreadPoolExecutor.h>
#include <folly/futures/Future.h>
#else
#include "threadpool.hpp"
#endif

int loop_size;
thread_local double dummy = 0.0;

void process(int num) {
  double x = 0;
  double sum = 0;
  for (int i = 0; i < loop_size; ++i) {
    x += 0.0001;
    sum += sin(x) / cos(x) + cos(x) * cos(x);
  }
  dummy += sum; // prevent optimization
}

int main(int argc, char* argv[]) {
  if (argc < 3) {
    std::cerr << argv[0] << " [thread_pool_size] [threads] [sleep_time]"
              << std::endl;
    exit(0);
  }
  int th_count = std::atoi(argv[1]);
#if USE_FOLLY
  auto executor = std::make_unique<folly::CPUThreadPoolExecutor>(th_count);
#else
  auto pool = std::make_unique<thread_pool>(th_count);
#endif
  loop_size = std::stoi(argv[3]);
  int max = std::stoi(argv[2]);

  auto then = std::chrono::steady_clock::now();
#if USE_FOLLY
  std::vector<folly::Future<folly::Unit>> futs;
  for (int i = 0; i < max; ++i) {
    futs.emplace_back(folly::via(executor.get()).then([i]() { process(i); }));
  }
  folly::collectAll(futs).get();
#else
  for (int i = 0; i < max; ++i) {
    pool->enqueue([i]() { process(i); });
  }
  pool = nullptr;
#endif

  int diff = std::chrono::duration_cast<std::chrono::milliseconds>(
                 std::chrono::steady_clock::now() - then)
                 .count();
  std::cerr << "Time: " << diff << '\n';
  return 0;
}

为什么多线程代码在速度更快的计算机上运行速度较慢？

2 个答案: