GCC 5及更高版本支持AVX2

时间:2017-04-01 04:20:07

标签: c++ gcc avx2

我写了下面的“T”类来加速对它的操纵 使用AVX2的“字符集”。然后我发现它不起作用 当我使用“-O3”时gcc 5及更高版本。 任何人都可以帮助我追溯到一些编程结构 众所周知,不会在最新的编译器/系统上工作?

此代码的工作原理:底层结构(“_bits”)是一个256字节的块(对齐并分配给AVX2),可以作为char [256]或AVX2元素访问,具体取决于元素是否为访问或整个事物用于向量操作。似乎它应该在AVX2平台上运行良好。否?

这很难调试,因为“valgrind”说它很干净, 我不能使用调试器(因为问题消失了 我删除“-O3”)。但是,我对“| =”的使用感到不满意 解决方法,因为如果这段代码真的错了,那我可能就是 在其他地方犯同样的错误并搞砸了一切 我发展了!

值得注意的是“|”操作员有问题但是 “| =”没有。问题可能与从中返回结构有关 一个功能?但我认为返回一个结构自1990年以来一直有效 什么的。

// g++ -std=c++11 -mavx2 -O3 gcc_fail.cpp

#include "assert.h"
#include "immintrin.h" // AVX

class T {
public:
  __m256i _bits[8];
  inline bool& operator[](unsigned char c)       {return ((bool*)_bits)[c];}
  inline bool  operator[](unsigned char c) const {return ((bool*)_bits)[c];}
  inline          T()                   {}
  inline explicit T(char const*);
  inline T     operator| (T const& b) const;
  inline T &   operator|=(T const& b);
  inline bool  operator! ()           const;
};

T::T(char const* s)
{
  _bits[0] = _bits[1] = _bits[2] = _bits[3] = _mm256_set1_epi32(0);
  _bits[4] = _bits[5] = _bits[6] = _bits[7] = _mm256_set1_epi32(0);
  char c;
  while ((c = *s++))
    (*this)[c] = true;
}

T T::operator| (T const& b) const
{
  T res;
  for (int i = 0; i < 8; i++)
    res._bits[i] = _mm256_or_si256(_bits[i], b._bits[i]);


  // FIXME why does the above code fail with -O3 in new gcc?
  for (int i=0; i<256; i++)
    assert(res[i] == ((*this)[i] || b[i]));
  // gcc 4.7.0 - PASS
  // gcc 4.7.2 - PASS
  // gcc 4.8.0 - PASS
  // gcc 4.9.2 - PASS
  // gcc 5.2.0 - FAIL
  // gcc 5.3.0 - FAIL
  // gcc 5.3.1 - FAIL
  // gcc 6.1.0 - FAIL


  return res;
}

T & T::operator|=(T const& b)
{
  for (int i = 0; i < 8; i++)
    _bits[i] = _mm256_or_si256(_bits[i], b._bits[i]);
  return *this;
}

bool T::operator! () const
{
  for (int i = 0; i < 8; i++)
    if (!_mm256_testz_si256(_bits[i], _bits[i]))
      return false;
  return true;
}

int Main()
{
  T sep (" ,\t\n");
  T end ("");
  return !(sep|end);
}

int main()
{
  return Main();
}

1 个答案:

答案 0 :(得分:8)

您的代码的问题是当您应该使用bool*时使用unsigned char*,这允许GCC 5继续进行指针别名优化。

由GCC 4.8.5和5.3.1生成的函数Main()的机器代码的两个转储,在附录的答案的末尾供参考。

查看代码:

反编译

在序幕之后,T sep的{​​{1}}被初始化为零......

_bits

然后在基于 _bits[0] = _bits[1] = _bits[2] = _bits[3] = _mm256_set1_epi32(0); _bits[4] = _bits[5] = _bits[6] = _bits[7] = _mm256_set1_epi32(0); 40063d: c5 fd 7f 44 24 60 vmovdqa %ymm0,0x60(%rsp) 400643: c5 fd 7f 44 24 40 vmovdqa %ymm0,0x40(%rsp) 400649: c5 fd 7f 44 24 20 vmovdqa %ymm0,0x20(%rsp) 40064f: c5 fd 7f 04 24 vmovdqa %ymm0,(%rsp) 400654: c5 fd 7f 84 24 e0 00 00 00 vmovdqa %ymm0,0xe0(%rsp) 40065d: c5 fd 7f 84 24 c0 00 00 00 vmovdqa %ymm0,0xc0(%rsp) 400666: c5 fd 7f 84 24 a0 00 00 00 vmovdqa %ymm0,0xa0(%rsp) 40066f: c5 fd 7f 84 24 80 00 00 00 vmovdqa %ymm0,0x80(%rsp) 的循环中写入。

char* s

然后两个编译器都将 char c; while ((c = *s++)) (*this)[c] = true; 400680: 48 83 c2 01 add $0x1,%rdx 400684: c6 04 04 01 movb $0x1,(%rsp,%rax,1) 400688: 0f b6 42 ff movzbl -0x1(%rdx),%eax 40068c: 84 c0 test %al,%al 40068e: 75 f0 jne 400680 <_Z4Mainv+0x60> 初始化为0:

T end

然后两个编译器都会优化 400690: c5 f9 ef c0 vpxor %xmm0,%xmm0,%xmm0 400694: 31 c0 xor %eax,%eax 400696: c5 fd 7f 84 24 60 01 00 00 vmovdqa %ymm0,0x160(%rsp) 40069f: c5 fd 7f 84 24 40 01 00 00 vmovdqa %ymm0,0x140(%rsp) 4006a8: c5 fd 7f 84 24 20 01 00 00 vmovdqa %ymm0,0x120(%rsp) 4006b1: c5 fd 7f 84 24 00 01 00 00 vmovdqa %ymm0,0x100(%rsp) 4006ba: c5 fd 7f 84 24 e0 01 00 00 vmovdqa %ymm0,0x1e0(%rsp) 4006c3: c5 fd 7f 84 24 c0 01 00 00 vmovdqa %ymm0,0x1c0(%rsp) 4006cc: c5 fd 7f 84 24 a0 01 00 00 vmovdqa %ymm0,0x1a0(%rsp) 4006d5: c5 fd 7f 84 24 80 01 00 00 vmovdqa %ymm0,0x180(%rsp) 操作,因为_mm256_or_si256()已知为T end。但是,GCC 4.8.5 0复制到T sep (这是计算上当你将OR变成零变量时会发生什么),而GCC 5.3.1 T res初始化为T res 。它有权这样做,因为在0方法中,您将operator []类型的指针强制转换为__m256i*,并且允许编译器假定指针不是别名。因此,在GCC 4.8.5中,您可以看到

bool*

在GCC 5.3.1中你看到了

  4006de:       c5 fd 6f 04 24                  vmovdqa (%rsp),%ymm0
  4006e3:       c5 fd 7f 84 24 00 02 00 00      vmovdqa %ymm0,0x200(%rsp)
  4006ec:       c5 fd 6f 44 24 20               vmovdqa 0x20(%rsp),%ymm0
  4006f2:       c5 fd 7f 84 24 20 02 00 00      vmovdqa %ymm0,0x220(%rsp)
  4006fb:       c5 fd 6f 44 24 40               vmovdqa 0x40(%rsp),%ymm0
  400701:       c5 fd 7f 84 24 40 02 00 00      vmovdqa %ymm0,0x240(%rsp)
  40070a:       c5 fd 6f 44 24 60               vmovdqa 0x60(%rsp),%ymm0
  400710:       c5 fd 7f 84 24 60 02 00 00      vmovdqa %ymm0,0x260(%rsp)
  400719:       c5 fd 6f 84 24 80 00 00 00      vmovdqa 0x80(%rsp),%ymm0
  400722:       c5 fd 7f 84 24 80 02 00 00      vmovdqa %ymm0,0x280(%rsp)
  40072b:       c5 fd 6f 84 24 a0 00 00 00      vmovdqa 0xa0(%rsp),%ymm0
  400734:       c5 fd 7f 84 24 a0 02 00 00      vmovdqa %ymm0,0x2a0(%rsp)
  40073d:       c5 fd 6f 84 24 c0 00 00 00      vmovdqa 0xc0(%rsp),%ymm0
  400746:       c5 fd 7f 84 24 c0 02 00 00      vmovdqa %ymm0,0x2c0(%rsp)
  40074f:       c5 fd 6f 84 24 e0 00 00 00      vmovdqa 0xe0(%rsp),%ymm0
  400758:       c5 fd 7f 84 24 e0 02 00 00      vmovdqa %ymm0,0x2e0(%rsp)

然后 4006fa: c5 fd 7f 85 f0 fe ff ff vmovdqa %ymm0,-0x110(%rbp) 400702: c5 fd 7f 85 10 ff ff ff vmovdqa %ymm0,-0xf0(%rbp) 40070a: c5 fd 7f 85 30 ff ff ff vmovdqa %ymm0,-0xd0(%rbp) 400712: c5 fd 7f 85 50 ff ff ff vmovdqa %ymm0,-0xb0(%rbp) 40071a: c5 fd 7f 85 70 ff ff ff vmovdqa %ymm0,-0x90(%rbp) 400722: c5 fd 7f 45 90 vmovdqa %ymm0,-0x70(%rbp) 400727: c5 fd 7f 45 b0 vmovdqa %ymm0,-0x50(%rbp) 40072c: c5 fd 7f 45 d0 vmovdqa %ymm0,-0x30(%rbp) 的读取失败。

标准对指针别名的裁决:

ISO C ++ 11引用了以下部分的别名,这清楚地表明使用assert()无法访问__m256i*类型的变量,但可以使用bool*访问:

  

§3.10Lvalues和rvalues [basic.lval]

     

[...]

     

如果程序试图通过以下类型之一以外的glvalue访问对象的存储值,则行为未定义:[52]

     
      
  • 对象的动态类型,
  •   
  • 对象的动态类型的cv限定版本,
  •   
  • 与对象的动态类型相似的类型(如4.4中所定义)
  •   
  • 与对象的动态类型对应的有符号或无符号类型的类型
  •   
  • 与对象的动态类型的cv限定版本对应的有符号或无符号类型的类型,
  •   
  • 聚合或联合类型,包括其元素或非静态数据成员中的上述类型之一(递归地,包括子聚合或包含联合的元素或非静态数据成员),
  •   
  • 一种类型,它是对象动态类型的(可能是cv限定的)基类类型,
  •   
  • char*/unsigned char*char类型。
  •   
           

52)此列表的目的是指定对象可能或可能没有别名的情况。

附录

GCC 4.8.5:

unsigned char

GCC 5:

0000000000400620 <_Z4Mainv>:
  400620:       55                              push   %rbp
  400621:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400625:       ba e5 08 40 00                  mov    $0x4008e5,%edx
  40062a:       b8 20 00 00 00                  mov    $0x20,%eax
  40062f:       48 89 e5                        mov    %rsp,%rbp
  400632:       48 83 e4 e0                     and    $0xffffffffffffffe0,%rsp
  400636:       48 81 ec 00 03 00 00            sub    $0x300,%rsp
  40063d:       c5 fd 7f 44 24 60               vmovdqa %ymm0,0x60(%rsp)
  400643:       c5 fd 7f 44 24 40               vmovdqa %ymm0,0x40(%rsp)
  400649:       c5 fd 7f 44 24 20               vmovdqa %ymm0,0x20(%rsp)
  40064f:       c5 fd 7f 04 24                  vmovdqa %ymm0,(%rsp)
  400654:       c5 fd 7f 84 24 e0 00 00 00      vmovdqa %ymm0,0xe0(%rsp)
  40065d:       c5 fd 7f 84 24 c0 00 00 00      vmovdqa %ymm0,0xc0(%rsp)
  400666:       c5 fd 7f 84 24 a0 00 00 00      vmovdqa %ymm0,0xa0(%rsp)
  40066f:       c5 fd 7f 84 24 80 00 00 00      vmovdqa %ymm0,0x80(%rsp)
  400678:       0f 1f 84 00 00 00 00 00         nopl   0x0(%rax,%rax,1)
  400680:       48 83 c2 01                     add    $0x1,%rdx
  400684:       c6 04 04 01                     movb   $0x1,(%rsp,%rax,1)
  400688:       0f b6 42 ff                     movzbl -0x1(%rdx),%eax
  40068c:       84 c0                           test   %al,%al
  40068e:       75 f0                           jne    400680 <_Z4Mainv+0x60>
  400690:       c5 f9 ef c0                     vpxor  %xmm0,%xmm0,%xmm0
  400694:       31 c0                           xor    %eax,%eax
  400696:       c5 fd 7f 84 24 60 01 00 00      vmovdqa %ymm0,0x160(%rsp)
  40069f:       c5 fd 7f 84 24 40 01 00 00      vmovdqa %ymm0,0x140(%rsp)
  4006a8:       c5 fd 7f 84 24 20 01 00 00      vmovdqa %ymm0,0x120(%rsp)
  4006b1:       c5 fd 7f 84 24 00 01 00 00      vmovdqa %ymm0,0x100(%rsp)
  4006ba:       c5 fd 7f 84 24 e0 01 00 00      vmovdqa %ymm0,0x1e0(%rsp)
  4006c3:       c5 fd 7f 84 24 c0 01 00 00      vmovdqa %ymm0,0x1c0(%rsp)
  4006cc:       c5 fd 7f 84 24 a0 01 00 00      vmovdqa %ymm0,0x1a0(%rsp)
  4006d5:       c5 fd 7f 84 24 80 01 00 00      vmovdqa %ymm0,0x180(%rsp)
  4006de:       c5 fd 6f 04 24                  vmovdqa (%rsp),%ymm0
  4006e3:       c5 fd 7f 84 24 00 02 00 00      vmovdqa %ymm0,0x200(%rsp)
  4006ec:       c5 fd 6f 44 24 20               vmovdqa 0x20(%rsp),%ymm0
  4006f2:       c5 fd 7f 84 24 20 02 00 00      vmovdqa %ymm0,0x220(%rsp)
  4006fb:       c5 fd 6f 44 24 40               vmovdqa 0x40(%rsp),%ymm0
  400701:       c5 fd 7f 84 24 40 02 00 00      vmovdqa %ymm0,0x240(%rsp)
  40070a:       c5 fd 6f 44 24 60               vmovdqa 0x60(%rsp),%ymm0
  400710:       c5 fd 7f 84 24 60 02 00 00      vmovdqa %ymm0,0x260(%rsp)
  400719:       c5 fd 6f 84 24 80 00 00 00      vmovdqa 0x80(%rsp),%ymm0
  400722:       c5 fd 7f 84 24 80 02 00 00      vmovdqa %ymm0,0x280(%rsp)
  40072b:       c5 fd 6f 84 24 a0 00 00 00      vmovdqa 0xa0(%rsp),%ymm0
  400734:       c5 fd 7f 84 24 a0 02 00 00      vmovdqa %ymm0,0x2a0(%rsp)
  40073d:       c5 fd 6f 84 24 c0 00 00 00      vmovdqa 0xc0(%rsp),%ymm0
  400746:       c5 fd 7f 84 24 c0 02 00 00      vmovdqa %ymm0,0x2c0(%rsp)
  40074f:       c5 fd 6f 84 24 e0 00 00 00      vmovdqa 0xe0(%rsp),%ymm0
  400758:       c5 fd 7f 84 24 e0 02 00 00      vmovdqa %ymm0,0x2e0(%rsp)
  400761:       0f 1f 80 00 00 00 00            nopl   0x0(%rax)
  400768:       80 3c 04 00                     cmpb   $0x0,(%rsp,%rax,1)
  40076c:       0f b6 8c 04 00 02 00 00         movzbl 0x200(%rsp,%rax,1),%ecx
  400774:       ba 01 00 00 00                  mov    $0x1,%edx
  400779:       75 08                           jne    400783 <_Z4Mainv+0x163>
  40077b:       0f b6 94 04 00 01 00 00         movzbl 0x100(%rsp,%rax,1),%edx
  400783:       38 d1                           cmp    %dl,%cl
  400785:       0f 85 b2 00 00 00               jne    40083d <_Z4Mainv+0x21d>
  40078b:       48 83 c0 01                     add    $0x1,%rax
  40078f:       48 3d 00 01 00 00               cmp    $0x100,%rax
  400795:       75 d1                           jne    400768 <_Z4Mainv+0x148>
  400797:       c5 fd 6f 8c 24 00 02 00 00      vmovdqa 0x200(%rsp),%ymm1
  4007a0:       31 c0                           xor    %eax,%eax
  4007a2:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007a7:       0f 94 c0                        sete   %al
  4007aa:       0f 85 88 00 00 00               jne    400838 <_Z4Mainv+0x218>
  4007b0:       c5 fd 6f 8c 24 20 02 00 00      vmovdqa 0x220(%rsp),%ymm1
  4007b9:       31 c0                           xor    %eax,%eax
  4007bb:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007c0:       0f 94 c0                        sete   %al
  4007c3:       75 73                           jne    400838 <_Z4Mainv+0x218>
  4007c5:       c5 fd 6f 8c 24 40 02 00 00      vmovdqa 0x240(%rsp),%ymm1
  4007ce:       31 c0                           xor    %eax,%eax
  4007d0:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007d5:       0f 94 c0                        sete   %al
  4007d8:       75 5e                           jne    400838 <_Z4Mainv+0x218>
  4007da:       c5 fd 6f 8c 24 60 02 00 00      vmovdqa 0x260(%rsp),%ymm1
  4007e3:       31 c0                           xor    %eax,%eax
  4007e5:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007ea:       0f 94 c0                        sete   %al
  4007ed:       75 49                           jne    400838 <_Z4Mainv+0x218>
  4007ef:       c5 fd 6f 8c 24 80 02 00 00      vmovdqa 0x280(%rsp),%ymm1
  4007f8:       31 c0                           xor    %eax,%eax
  4007fa:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  4007ff:       0f 94 c0                        sete   %al
  400802:       75 34                           jne    400838 <_Z4Mainv+0x218>
  400804:       c5 fd 6f 8c 24 a0 02 00 00      vmovdqa 0x2a0(%rsp),%ymm1
  40080d:       31 c0                           xor    %eax,%eax
  40080f:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  400814:       0f 94 c0                        sete   %al
  400817:       75 1f                           jne    400838 <_Z4Mainv+0x218>
  400819:       c5 fd 6f 8c 24 c0 02 00 00      vmovdqa 0x2c0(%rsp),%ymm1
  400822:       31 c0                           xor    %eax,%eax
  400824:       c4 e2 7d 17 c9                  vptest %ymm1,%ymm1
  400829:       0f 94 c0                        sete   %al
  40082c:       75 0a                           jne    400838 <_Z4Mainv+0x218>
  40082e:       31 c0                           xor    %eax,%eax
  400830:       c4 e2 7d 17 c0                  vptest %ymm0,%ymm0
  400835:       0f 94 c0                        sete   %al
  400838:       c5 f8 77                        vzeroupper 
  40083b:       c9                              leaveq 
  40083c:       c3                              retq   
  40083d:       b9 20 09 40 00                  mov    $0x400920,%ecx
  400842:       ba 26 00 00 00                  mov    $0x26,%edx
  400847:       be e9 08 40 00                  mov    $0x4008e9,%esi
  40084c:       bf f8 08 40 00                  mov    $0x4008f8,%edi
  400851:       c5 f8 77                        vzeroupper 
  400854:       e8 97 fc ff ff                  callq  4004f0 <__assert_fail@plt>
  400859:       0f 1f 80 00 00 00 00            nopl   0x0(%rax)