Question

我在Raspberry 3上使用Raspbian。

我需要将代码分成几个块（2或4）并为每个块分配一个线程以加快计算速度。

目前，我正在测试一个线程上的简单循环（参见附加代码），然后测试4个线程。并且4个线程上的执行时间总是长4倍，所以看起来这4个线程被安排在同一个CPU上运行。

如何指定每个线程在其他CPU上运行？即使2个CPU上的2个线程也应该对我有很大影响。

我甚至尝试使用g ++ 6而没有改进。并在代码中使用并行库openmp和＃34; #pragma omp for＆＃34;仍然在一个CPU上运行。

我尝试在Fedora Linux x86上运行此代码并且我有相同的行为，但是在Windows 8.1和VS2015上我得到了不同的结果，其中时间是相同的一个线程然后是4个线程，所以它运行在不同的CPU。

你有什么建议吗？

谢谢。

#include <iostream>
//#include <arm_neon.h>
#include <ctime>
#include <thread>
#include <mutex>
#include <iostream>
#include <vector>
using namespace std;

float simd_dot0() {

unsigned int i;
unsigned long rezult;
   for (i = 0; i < 0xfffffff; i++) {
    rezult = i;
   }
 return rezult;
 }

 int main() {

 unsigned num_cpus = std::thread::hardware_concurrency();

 std::mutex iomutex;
 std::vector<std::thread> threads(num_cpus);
 cout << "Start Test 1 CPU" << endl; // prints !!!Hello World!!!
 double t_start, t_end, scan_time;
 scan_time  = 0;
 t_start = clock();
 simd_dot0();
 t_end = clock();
 scan_time += t_end - t_start;
 std::cout << "\nExecution time on 1 CPU: "
   << 1000.0 * scan_time / CLOCKS_PER_SEC << "ms" << std::endl;
 cout << "Finish Test on 1 CPU" << endl; // prints !!!Hello World!!!
 cout << "Start Test 4 CPU" << endl; // prints !!!Hello World!!!
 scan_time  = 0;
 t_start = clock();
 for (unsigned i = 0; i < 4; ++i) {
   threads[i] = std::thread([&iomutex, i] {
     {
       simd_dot0();
                std::cout << "\nExecution time on CPU: "
                          << i << std::endl;

     }

     // Simulate important work done by the tread by sleeping for a bit...

   });
 }

 for (auto& t : threads) {
   t.join();
 }

 t_end = clock();
 scan_time += t_end - t_start;
 std::cout << "\nExecution time on 4 CPUs: "
      << 1000.0 * scan_time / CLOCKS_PER_SEC << "ms" << std::endl;
 cout << "Finish Test on 4 CPU" << endl; // prints !!!Hello World!!!
 cout << "!!!Hello World!!!" << endl; // prints !!!Hello World!!!
 while (1);
 return 0;
}

修改：

在Raspberry Pi3 Raspbian上，我使用了带有以下标志的g ++ 4.9和6：

-std=c++11 -ftree-vectorize -Wl--no-as-needed -lpthread -march=armv8-a+crc -mcpu=cortex-a53 -mfpu=neon-fp-armv8 -funsafe-math-optimizations -O3

多个线程在一个核心而不是四个核心上运行，具体取决于操作系统

0 个答案: