我不明白为什么这些代码没有用gcc 4.4.6进行矢量化
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
for (int i = 0; i < iSize; i++)
pfResult[i] = pfResult[i] + pfTab[iIndex];
}
note: not vectorized: unhandled data-ref
但是,如果我写下面的代码
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
float fTab = pfTab[iIndex];
for (int i = 0; i < iSize; i++)
pfResult[i] = pfResult[i] + fTab;
}
gcc成功自动向量化此循环
如果我添加omp指令
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
float fTab = pfTab[iIndex];
#pragma omp parallel for
for (int i = 0; i < iSize; i++)
pfResult[i] = pfResult[i] + fTab;
}
我有以下错误没有矢量化:未处理的data-ref
你能帮我解释为什么第一个代码和第三个代码没有自动矢量化吗?
第二个问题: 数学操作数似乎没有矢量化(exp,log等等),这个代码例如
for (int i = 0; i < iSize; i++)
pfResult[i] = exp(pfResult[i]);
没有矢量化。这是由于我的gcc版本?
修改: 使用新版本的gcc 4.8.1和openMP 2011(echo | cpp -fopenmp -dM | grep -i open) 即使基本上
,我对所有类型的循环都有以下错误 for (iGID = 0; iGID < iSize; iGID++)
{
pfResult[iGID] = fValue;
}
note: not consecutive access *_144 = 5.0e-1;
note: Failed to SLP the basic block.
note: not vectorized: failed to find SLP opportunities in basic block.
EDIT2:
#include<stdio.h>
#include<sys/time.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
#include <omp.h>
int main()
{
int szGlobalWorkSize = 131072;
int iGID = 0;
int j = 0;
omp_set_dynamic(0);
// warmup
#if WARMUP
#pragma omp parallel
{
#pragma omp master
{
printf("%d threads\n", omp_get_num_threads());
}
}
#endif
printf("Pagesize=%d\n", getpagesize());
float *pfResult = (float *)malloc(szGlobalWorkSize * 100* sizeof(float));
float fValue = 0.5f;
struct timeval tim;
gettimeofday(&tim, NULL);
double tLaunch1=tim.tv_sec+(tim.tv_usec/1000000.0);
double time = omp_get_wtime();
int iChunk = getpagesize();
int iSize = ((int)szGlobalWorkSize * 100) / iChunk;
//#pragma omp parallel for
for (iGID = 0; iGID < iSize; iGID++)
{
pfResult[iGID] = fValue;
}
time = omp_get_wtime() - time;
gettimeofday(&tim, NULL);
double tLaunch2=tim.tv_sec+(tim.tv_usec/1000000.0);
printf("%.6lf Time1\n", tLaunch2-tLaunch1);
printf("%.6lf Time2\n", time);
}
结果
#define _OPENMP 201107
gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)
gcc -march=native -fopenmp -O3 -ftree-vectorizer-verbose=2 test.c -lm
很多
note: Failed to SLP the basic block.
note: not vectorized: failed to find SLP opportunities in basic block.
and note: not consecutive access *_144 = 5.0e-1;
由于
答案 0 :(得分:7)
GCC无法对第一个版本的循环进行矢量化,因为它无法证明pfTab[iIndex]
未包含在由pfResult[0] ... pfResult[iSize-1]
跨越的内存中(指针别名)。实际上,如果pfTab[iIndex]
位于该内存中的某个位置,则其值必须由循环体中的赋值覆盖,并且必须在迭代中使用新值。您应该使用restrict
关键字来提示编译器这种情况永远不会发生,然后它应该很乐意地向您的代码进行矢量化:
$ cat foo.c
int MyFunc(const float *restrict pfTab, float *restrict pfResult,
int iSize, int iIndex)
{
for (int i = 0; i < iSize; i++)
pfResult[i] = pfResult[i] + pfTab[iIndex];
}
$ gcc -v
...
gcc version 4.6.1 (GCC)
$ gcc -std=c99 -O3 -march=native -ftree-vectorizer-verbose=2 -c foo.c
foo.c:3: note: LOOP VECTORIZED.
foo.c:1: note: vectorized 1 loops in function.
第二个版本会向量化,因为值会传输到具有自动存储持续时间的变量。这里的一般假设是pfResult
不会跨越存储fTab
的堆栈内存(通过C99语言规范的粗略读取并不清楚如果该假设是弱的或者是标准允许它。)
由于OpenMP在GCC中的实现方式,OpenMP版本无法进行矢量化。它使用并行区域的代码大纲。
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
float fTab = pfTab[iIndex];
#pragma omp parallel for
for (int i = 0; i < iSize; i++)
pfResult[i] = pfResult[i] + fTab;
}
有效地成为:
struct omp_data_s
{
float *pfResult;
int iSize;
float *fTab;
};
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
float fTab = pfTab[iIndex];
struct omp_data_s omp_data_o;
omp_data_o.pfResult = pfResult;
omp_data_o.iSize = iSize;
omp_data_o.fTab = fTab;
GOMP_parallel_start (MyFunc_omp_fn0, &omp_data_o, 0);
MyFunc._omp_fn.0 (&omp_data_o);
GOMP_parallel_end ();
pfResult = omp_data_o.pfResult;
iSize = omp_data_o.iSize;
fTab = omp_data_o.fTab;
}
void MyFunc_omp_fn0 (struct omp_data_s *omp_data_i)
{
int start = ...; // compute starting iteration for current thread
int end = ...; // compute ending iteration for current thread
for (int i = start; i < end; i++)
omp_data_i->pfResult[i] = omp_data_i->pfResult[i] + omp_data_i->fTab;
}
MyFunc_omp_fn0
包含概述的功能代码。编译器无法证明omp_data_i->pfResult
没有指向别名为omp_data_i
的内存,特别是其成员fTab
。
为了对该循环进行矢量化,您必须制作fTab
firstprivate
。这将在概述的代码中将其转换为自动变量,这将等同于您的第二种情况:
$ cat foo.c
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
float fTab = pfTab[iIndex];
#pragma omp parallel for firstprivate(fTab)
for (int i = 0; i < iSize; i++)
pfResult[i] = pfResult[i] + fTab;
}
$ gcc -std=c99 -fopenmp -O3 -march=native -ftree-vectorizer-verbose=2 -c foo.c
foo.c:6: note: LOOP VECTORIZED.
foo.c:4: note: vectorized 1 loops in function.