我正在使用OpenMP优化库。我在两个不同的平台上对库进行了基准测试:
为了在手机上执行代码,我只是交叉编译工作站上的所有内容,并使用使用adb的脚本控制基准测试。但是,我遇到了一些问题,因为我想要的一切都是优化的,即在手机上接近8理论加速。解释是执行简单的矩阵乘法运算时的CPU使用率。我有这个基本代码,可以帮助我衡量用法:
#include <cstdio>
#include <cstdlib>
// For custom types
#include "smu/core.h"
int main(void) {
long double cpua[4], cpub[4], loadavg;
FILE *fp;
char dump[50];
// Setting matrices
int32 nr = 500;
int32 nc = 500;
float32 *a = (float32*)malloc(nr * nc * sizeof(float32));
float32 *b = (float32*)malloc(nr * nc * sizeof(float32));
float32 *c = (float32*)malloc(nr * nc * sizeof(float32));
for (int32 i = 0; i < nr; ++i) {
float32 *adata = a + i * nc;
float32 *bdata = b + i * nc;
int32 cache_nc = nc;
for (int32 j = 0; j < cache_nc; ++j) {
adata[j] = (float32)rand() / (float32)RAND_MAX * 100.;
bdata[j] = (float32)rand() / (float32)RAND_MAX * 100. - 50.;
}
}
for(;;) {
fp = fopen("/proc/stat", "r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf", &cpua[0], &cpua[1], &cpua[2], &cpua[3]);
fclose(fp);
for (int32 i = 0; i < nr ; ++i) {
int32 cache_nc = nc;
float32 *adata = a + i * cache_nc;
float32 *cdata = c + i * cache_nc;
for (int32 j = 0; j < cache_nc; ++j) {
cdata[j] = 0.;
for (int32 k = 0; k < cache_nc; ++k)
cdata[j] += adata[k] * b[k * cache_nc + j];
}
}
fp = fopen("/proc/stat", "r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf", &cpub[0], &cpub[1], &cpub[2], &cpub[3]);
fclose(fp);
loadavg = ((cpub[0] + cpub[1] + cpub[2]) - (cpua[0] + cpua[1] + cpua[2])) /
((cpub[0] + cpub[1] + cpub[2] + cpub[3]) - (cpua[0] + cpua[1] + cpua[2] + cpua[3]));
printf("CPU usage : %Lf\n", loadavg);
fp = fopen("/proc/stat", "r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf", &cpua[0], &cpua[1], &cpua[2], &cpua[3]);
fclose(fp);
#pragma omp parallel for num_threads(8) schedule(dynamic, 1)
for (int32 i = 0; i < nr ; ++i) {
int32 cache_nc = nc;
float32 *adata = a + i * cache_nc;
float32 *cdata = c + i * cache_nc;
for (int32 j = 0; j < cache_nc; ++j) {
cdata[j] = 0.;
for (int32 k = 0; k < cache_nc; ++k)
cdata[j] += adata[k] * b[k * cache_nc + j];
}
}
fp = fopen("/proc/stat", "r");
fscanf(fp,"%*s %Lf %Lf %Lf %Lf", &cpub[0], &cpub[1], &cpub[2], &cpub[3]);
fclose(fp);
loadavg = ((cpub[0] + cpub[1] + cpub[2]) - (cpua[0] + cpua[1] + cpua[2])) /
((cpub[0] + cpub[1] + cpub[2] + cpub[3]) - (cpua[0] + cpua[1] + cpua[2] + cpua[3]));
printf("CPU usage with OpenMP : %Lf\n", loadavg);
}
free(a);
free(b);
free(c);
return(0);
}
在我的x86工作站上,结果符合预期:
CPU usage : 0.267606
CPU usage with OpenMP : 1.000000
CPU usage : 0.271429
CPU usage with OpenMP : 1.000000
虽然在手机上似乎无法立刻获得所有内核:
CPU usage : 0.143388
CPU usage with OpenMP : 0.495968
CPU usage : 0.129955
CPU usage with OpenMP : 0.496626
这很奇怪,因为No OpenMP的使用让我想到,因为只使用了8个核心上的1个。我查看了OpenMP平台信息,他可以在Honor 5c上正确看到8个核心。
我的问题是:
编辑:
我试图通过执行这个简单的脚本直接在操作系统中看到他如何处理核心:
#!/system/bin/sh
i=0
while : ; do
i=$(($i + 1))
done
即使有8个线程运行它也会导致最多50%的CPU使用率。
我读到这篇文章解释there could be several OS in a phone只使其中一个可用。在我的情况下,每组4个核心将是1。但后来我不明白为什么OpenMP会看到8个内核......