我希望我的代码可以通过编译器自动生成,但我似乎无法正确使用它。
特别是我通过-ftree-vectorizer-verbose=6
收到的消息
选项为125: not vectorized: not suitable for gather D.32476_34 = *D.32475_33;
。
现在我的问题是这个消息的意思是什么,这些数字代表什么?
Bellow,我创建了一个生成相同消息的简单测试示例, 所以我认为这些问题是相关的。
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices, int indices_num)
{
for (int i = 0; i < indices_num; ++i)
{
int idx = indices[i] * 4;
float r = pixels[idx + 0];
float g = pixels[idx + 1];
float b = pixels[idx + 2];
float a = pixels[idx + 3] / 255.0f;
pixels[idx + 0] = r;
pixels[idx + 1] = g;
pixels[idx + 2] = b;
pixels[idx + 3] = a * 255.0f;
}
return;
}
此外,在创建我的示例时,我遇到了大量其他消息, 我不确定他们的意思或为什么特定的构造 矢量化有问题,所以有任何指南,书籍,教程,博客,等等 这会向我解释这些事情吗?
如果这很重要,我使用的是带有QtCreator 2.7.0的MingW 4.7 32位。
编辑:结论:
根据我在本文中的测试和建议,该消息很可能与通过辅助索引数组间接访问数据有关,这导致gather/scatter addressing scheme并且目前GCC
无法(或不希望)对此进行矢量化。我能够使用clang++ 3.2-1
生成矢量化代码。
答案 0 :(得分:2)
代码的矢量化版本在概念上看起来像(使用OpenCL语法):
for (int i = 0; i < indices_num; ++i)
{
int idx = indices[i] * 4;
float4 factor = (1, 1, 1, 255.0f);
char4 x1 = vload4(idx, pixels); // Line A
float4 x2 = convert_float4(x1);
float4 x3 = x2 / factor;
float4 x4 = x3 * factor;
char4 x5 = convert_char4(x4);
vstore4(x5, idx, pixels); // Line B
}
但坚持下去;在A行,您尝试从内存中加载四个字符(又名uint8),并将它们存储在B行。这不是x86的常用功能;我所知道的唯一指令集是支持AVX2(Intel Haswells及更高版本)和Xeon Phi。除非您正在编译其中之一,否则这可以解释为什么您的编译器会拒绝此向量化机会。
编译器当然可以单独加载4个uint8s,从中构建一个向量,执行所需的向量操作,然后手动存储4个值;但我猜测,没有收集和分散,单独加载和存储值可能被认为与通过矢量化保存的实际工作量相比太贵了。
答案 1 :(得分:1)
尝试使用这个代码,该代码具有多个(和除)你的待矢量化变量的向量。:
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices, int indices_num)
{
float dividerV[4]={1.0f,1.0f,1.0f,255.0f};
float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits
//Can use same vector to both multiply and divide if you want. But having different vectors can give some more pipelining(also needs more mem acccess so pick carefully)
for (int i = 0; i < indices_num; ++i)
{
int idx = indices[i] * 4;
float r = pixels[idx + 0]/dividerV[0];
float g = pixels[idx + 1]/dividerV[1];
float b = pixels[idx + 2]/dividerV[2];
float a = pixels[idx + 3]/dividerV[3];
pixels[idx + 0] = r*multiplierV[0];
pixels[idx + 1] = g*multiplierV[1];
pixels[idx + 2] = b*multiplierV[2];
pixels[idx + 3] = a*multiplierV[3];
}
return;
}
也许这更容易矢量化。
Aginst未知的循环边界,尝试给出直接常量而不是indices_num。这个编译器不仅仅是及时的(可能是但我没有听说过java以外的其他内容)因此,给出一个编译时已知的常量也许可行。
下面:
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices)
{
float dividerV[4]={1.0f,1.0f,1.0f,255.0f};
float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits
//Can use same vector to both multiply and divide if you want. But having different vectors can give some more pipelining(also needs more mem acccess so pick carefully)
for (int i = 0; i < 1000; ++i)
{
int idx = indices[i] * 4;
float r = pixels[idx + 0]/dividerV[0];
float g = pixels[idx + 1]/dividerV[1];
float b = pixels[idx + 2]/dividerV[2];
float a = pixels[idx + 3]/dividerV[3];
pixels[idx + 0] = r*multiplierV[0];
pixels[idx + 1] = g*multiplierV[1];
pixels[idx + 2] = b*multiplierV[2];
pixels[idx + 3] = a*multiplierV[3];
}
return;
}
有时,对于矢量化指令,数组未正确对齐。例如,cpu只能为32B(或16B)对齐的数组提高读/写性能。未对齐的读/写速度较慢(或不可向量化)
下面:
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices)
{
float dividerV[4]={1.0f,1.0f,1.0f,255.0f};
float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits
if(reinterpret_cast<size_t>pixels%32!=0)
{
printf("array is not aligned! need to shift array or need to do serial calc. until aligned offset reached!");
//do non-vectorized calc. When aligned offset reached, goto vectorizing code.
}
else
{
printf("array is aligned! Starting fast access.");
for (int i = 0; i < 1000; ++i)
{
int idx = indices[i] * 4;
float r = pixels[idx + 0]/dividerV[0];
float g = pixels[idx + 1]/dividerV[1];
float b = pixels[idx + 2]/dividerV[2];
float a = pixels[idx + 3]/dividerV[3];
pixels[idx + 0] = r*multiplierV[0];
pixels[idx + 1] = g*multiplierV[1];
pixels[idx + 2] = b*multiplierV[2];
pixels[idx + 3] = a*multiplierV[3];
}
return;
}
}
也许有人可以打开memcpy或一些数组复制asm文件并在其中注入一些乘法代码并编译为memcpy_with_multiplication(,,,)?
我的最后一个建议:将r,g,b,a包装在单个数组中,使它们处于连续的地址中。 这里:
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices)
{
float dividerV[4]={1.0f,1.0f,1.0f,255.0f};
float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits
//Can use same vector to both multiply and divide if you want. But having different vectors can give some more pipelining(also needs more mem acccess so pick carefully)
for (int i = 0; i < 1000; ++i)
{
int idx = indices[i] * 4;
float rgba[4];
rgba[0] = pixels[idx + 0]/dividerV[0];
rgba[1] = pixels[idx + 1]/dividerV[1];
rgba[2] = pixels[idx + 2]/dividerV[2];
rgba[3] = pixels[idx + 3]/dividerV[3];
pixels[idx + 0] = rgba[0]*multiplierV[0];
pixels[idx + 1] = rgba[1]*multiplierV[1];
pixels[idx + 2] = rgba[2]*multiplierV[2];
pixels[idx + 3] = rgba[3]*multiplierV[3];
}
return;
}
&#34;指数为[I]&#34;不是一个明确的指责论点。这可能很糟糕。尝试一些其他方式向编译器显示。当你只使用i而不是index [i]时会发生什么?它编译相同吗? indices [i]在编译时无法知道,或者对于编译器来说太复杂了。
更简单(也是错误的)和更多可矢量化:
static void not_suitable_for_gather(unsigned char * __restrict__ pixels, int * __restrict__ indices)
{
float dividerV[4]={1.0f,1.0f,1.0f,255.0f};
float multiplierV[4]={1.0f,1.0f,1.0f,255.0f}; //choose anything that suits
//you need to sorted version of indices[](or pixels[]) array to achieve something like this.
for (int i = 0; i < 4000; i+=4)
{
float rgba[4];
rgba[0] = pixels[i + 0]/dividerV[0];
rgba[1] = pixels[i + 1]/dividerV[1];
rgba[2] = pixels[i + 2]/dividerV[2];
rgba[3] = pixels[i + 3]/dividerV[3];
pixels[i + 0] = rgba[0]*multiplierV[0];
pixels[i + 1] = rgba[1]*multiplierV[1];
pixels[i + 2] = rgba[2]*multiplierV[2];
pixels[i + 3] = rgba[3]*multiplierV[3];
}
return;
}