以下是使用MKL向量添加方法的代码:
#include "mkl.h"
#include <ctime>
#include <chrono>
#include <iostream>
int main() {
const int n = 10000000;
int nbRuns = 1000;
double *a = (double *) mkl_malloc(n * sizeof(double), 64);
double *b = (double *) mkl_malloc(n * sizeof(double), 64);
double *c = (double *) mkl_malloc(n * sizeof(double), 64);
for (int i = 0; i < n; i++) {
a[i] = i;
b[i] = i;
}
// First run not considered
vdAdd(n, a, b, c); // MKL call
auto start = std::chrono::system_clock::now();
for (int i = 0; i < nbRuns; i++) {
vdAdd(n, a, b, c); // MKL call
}
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed = end - start;
std::cout << "Time: " << elapsed.count() << " sec." << std::endl;
return 0;
}
在线程数增加时,在一台机器上进行测试可以提供正常的加速,但在另一台机器上,根本没有任何改进,即使MKL使用多个线程(这在系统监视器中可见) 。我正在Linux上编译mkl_rt(g++ -std=c++14
)。我错过了什么吗?
更新: 事实证明,OpenMP也表现出与以下代码(不使用MKL)所示相同的行为:
#include <omp.h>
#include <ctime>
#include <chrono>
#include <iostream>
int main() {
const int n = 10000000;
int nbRuns = 1000;
double *a = new double[n];
double *b = new double[n];
double *c = new double[n];
for (int i = 0; i < n; i++) {
a[i] = i;
b[i] = i;
}
auto start = std::chrono::system_clock::now();
for (int i = 0; i < nbRuns; i++) {
#pragma omp parallel for
for (int j = 0; j < n; j++) {
c[j] = a[j] + b[j];
}
}
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed = end - start;
std::cout << "Time: " << elapsed.count() << " sec." << std::endl;
return 0;
}
以下是一些技术信息:
Machine 1 (with no speedup): This is a Laptop (hp)
CPU:
model name : Intel(R) Core(TM) i7 CPU Q 840 @ 1.87GHz
cpu MHz : 1199.000
cache size : 8192 KB
Memory:
Handle 0x0004, DMI type 16, 15 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: None
Maximum Capacity: 16 GB
Error Information Handle: Not Provided
Number Of Devices: 4
Handle 0x0005, DMI type 17, 27 bytes
Memory Device
Total Width: 64 bits
Data Width: 64 bits
Size: 4096 MB
Form Factor: SODIMM
Locator: Top-Slot 1(top)
Bank Locator: BANK 0
Type: DDR3
Type Detail: Synchronous
Speed: 1333 MHz
Machine 2 (normal speedup): This is a single node of large HPC cluster
CPU:
model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
cpu MHz : 2001.000
cache size : 20480 KB
Memory:
The node has 128GiB, but since I do not have root privileges, I cannot gather more info.
Compiler: gcc 6.1 (same effect with 5.3)
g++ -std=c++14 -O3 -fopenmp -o testomp testomp.cpp
Number of threads controlled by OMP_NUM_THREADS