`__builtin_assume_aligned`是更嘈杂的语法，但适用于所有支持GNU C扩展的编译器。

Question

在我的计划中，我需要将__attribute__(( aligned(32)))应用于int *或float * 我试过这样但我不确定它会起作用。

int  *rarray __attribute__(( aligned(32)));

我看到this但未找到答案

Answer 1

所以你想告诉编译器你的指针是对齐的吗？例如该函数的所有调用者都将传递保证对齐的指针。指向对齐静态或本地存储的指针，或指向C11 aligned_alloc或POSIX posix_memalign的指针。（如果这些不可用，_mm_malloc是一个选项，但free结果不保证_mm_malloc是安全的：您需要_mm_free）。这允许编译器自动向量化，而无需使用一堆膨胀代码来处理未对齐的输入。

使用内在函数手动向量化时，可以使用_mm256_loadu_si256或_mm256_load_si256通知编译器内存是否已对齐。传递对齐信息是加载/存储内在函数的主要点，而不是简单地解除引用__m256i指针。

我认为没有一种可移植的方式来通知编译器指针指向对齐的内存。（C11 / C ++ 11 alignas似乎无法做到这一点，见下文）。

使用GNU C __attribute__ syntax，似乎有必要使用typedef来获取要应用于指向类型的属性，而不是指针本身。如果您声明aligned_int类型或某种内容，那么输入更容易更容易阅读。

// Only helps GCC, not clang or ICC typedef __attribute__(( aligned(32))) int aligned_int; int my_func(const aligned_int *restrict a, const aligned_int *restrict b) { int sum = 0; for (int i=0 ; i<1024 ; i++) { sum += a[i] - b[i]; } return sum; }

这auto-vectorizes without any bloat for handling unaligned inputs (gcc 5.3 with -O3 on godbolt)

pxor xmm0, xmm0 xor eax, eax .L2: psubd xmm0, XMMWORD PTR [rsi+rax] paddd xmm0, XMMWORD PTR [rdi+rax] add rax, 16 cmp rax, 4096 jne .L2 # end of vector loop ... # horizontal sum with psrldq omitted, see the godbolt link if you're curious movd eax, xmm0 ret

如果没有aligned属性，你会得到一大块标量intro / outro代码，-march=haswell使得AVX2代码具有更宽的内循环会更糟糕。

Clang对未对齐输入的正常策略是使用未对齐的加载/存储，而不是完全展开的intro / outro循环。如果没有AVX，这意味着无法将负载折叠到SSE ALU操作的内存操作数中。

The aligned attribute doesn't help clang (tested as recently as clang7.0): it still uses separate movdqu loads.请注意，clang的循环更大，因为它默认展开4，而gcc根本没有展开-funroll-loops（由-fprofile-use启用）。

但请注意，此aligned_int typedef 仅适用于GCC本身，而非clang或ICC 。 gcc memory alignment pragma还有另一个例子。

__builtin_assume_aligned是更嘈杂的语法，但适用于所有支持GNU C扩展的编译器。

请参阅How to tell GCC that a pointer argument is always double-word-aligned?

请注意，您无法创建aligned_int 数组。（参见sizeof(aligned_int)讨论的评论，以及它仍然是4，而不是32的事实。 GNU C拒绝将其视为int - 带填充，因此使用gcc 5.3：

static aligned_int arr[1024]; // error: alignment of array elements is greater than element size int tmp = sizeof(arr);

clang-3.8编译它，并将tmp初始化为4096.大概是因为它完全忽略了那个上下文中的aligned属性，没有做任何魔法gcc所做的那种类型比它更窄的类型需要对齐。（因此，只有每四个元素实际上具有该对齐。）

gcc docs声称使用struct 上的aligned属性允许您创建一个数组，并且这是主要用例之一。但是，正如@ user3528438在评论中指出的那样this is not the case：你得到的结果与尝试声明aligned_int数组时的错误相同。情况就是since 2005。

要定义对齐的局部或静态/全局数组，aligned属性应该应用于整个数组，而不是应用于每个元素。

在便携式C11和C ++ 11中，您可以使用alignas(32) int myarray[1024];之类的东西。另请参阅Struggling with alignas syntax：它似乎只对自身对齐有用，而不是声明指针指向对齐的内存。 std::align更像((uintptr_t)ptr) & ~63或其他：强制对齐指针而不是告诉编译器它已经对齐。

// declaring aligned storage for arrays #ifndef __cplusplus #include <stdalign.h> // for C11: defines alignas() using _Alignas() #endif // C++11 defines alignas without any headers // works for global/static or local (aka automatic storage) alignas(32) int foo[1000]; // portable ISO C++11 and ISO C11 syntax // __attribute__((aligned(32))) int foo[1000]; // older GNU C // __declspec something // older MSVC

请参阅cppreference上的C11 alignas() documentation。

如果您希望在不支持C11的旧编译器上实现可移植性，则CPP宏可用于在GNU C __attribute__语法和MSVC __declspec语法之间进行选择。

e.g。使用此代码声明一个局部数组具有比堆栈指针更多的对齐，编译器必须创建空间然后AND堆栈指针以获得对齐的指针：

void foo(int *p); void bar(void) { __attribute__((aligned(32))) int a[1000]; foo (a); }

compiles to (clang-3.8 -O3 -std=gnu11 for x86-64)

push rbp mov rbp, rsp # stack frame with base pointer since we're doing unpredictable things to rsp and rsp, -32 # 32B-align the stack sub rsp, 4032 # reserve up to 32B more space than needed lea rdi, [rsp] # this is weird: mov rdi,rsp is a shorter insn to set up foo's arg call foo mov rsp, rbp pop rbp ret

gcc（晚于4.8.2）使得大量代码无缘无故地进行了大量的额外工作，最奇怪的是push QWORD PTR [r10-8]将一些堆栈内存复制到堆栈上的另一个位置。（在godbolt链接上查看：翻转到gcc）。

如何将attribute （（aligned（32）））应用于int *？

1 个答案:

`__builtin_assume_aligned`是更嘈杂的语法，但适用于所有支持GNU C扩展的编译器。

如何将__attribute __（（aligned（32）））应用于int *？

1 个答案:

__builtin_assume_aligned是更嘈杂的语法，但适用于所有支持GNU C扩展的编译器。

如何将attribute （（aligned（32）））应用于int *？

`__builtin_assume_aligned`是更嘈杂的语法，但适用于所有支持GNU C扩展的编译器。