已更新：更快的新算法

Question

如何将指针对齐到16字节边界？

我找到了这段代码，不确定它是否正确

char* p= malloc(1024);

if ((((unsigned long) p) % 16) != 0) 
{
     unsigned char *chpoint = (unsigned char *)p;        
     chpoint += 16 - (((unsigned long) p) % 16);
     p = (char *)chpoint;
}

这会有用吗？

感谢

Answer 1

C ++ 0x提出std::align，就是这样。

// get some memory
T* const p = ...;
std::size_t const size = ...;

void* start = p;
std::size_t space = size;
void* aligned = std::align(16, 1024, p, space);
if(aligned == nullptr) {
    // failed to align
} else {
    // here, p is aligned to 16 and points to at least 1024 bytes of memory
    // also p == aligned
    // size - space is the amount of bytes used for alignment
}

这似乎非常低级。我想

// also available in Boost flavour
using storage = std::aligned_storage_t<1024, 16>;
auto p = new storage;

也有效。如果你不小心的话，你可以很容易地违反别名规则。如果你有一个精确的场景（在一个16字节的边界处拟合T型的N个对象？）我想我可以推荐一些更好的东西。

Answer 2

试试这个：

它返回对齐的内存并释放内存，几乎没有额外的内存管理开销。

#include <malloc.h>
#include <assert.h>

size_t roundUp(size_t a, size_t b) { return (1 + (a - 1) / b) * b; }

// we assume here that size_t and void* can be converted to each other
void *malloc_aligned(size_t size, size_t align = sizeof(void*))
{
    assert(align % sizeof(size_t) == 0);
    assert(sizeof(void*) == sizeof(size_t)); // not sure if needed, but whatever

    void *p = malloc(size + 2 * align);  // allocate with enough room to store the size
    if (p != NULL)
    {
        size_t base = (size_t)p;
        p = (char*)roundUp(base, align) + align;  // align & make room for storing the size
        ((size_t*)p)[-1] = (size_t)p - base;      // store the size before the block
    }
    return p;
}

void free_aligned(void *p) { free(p != NULL ? (char*)p - ((size_t*)p)[-1] : p); }

警告：

我很确定我在这里踩着C标准的一部分，但谁在乎呢。：P

Answer 3

在glibc库malloc中，realloc始终返回8个字节对齐。如果您想分配具有更高功率2的一些对齐的内存，那么您可以使用memalign和posix_memalign。阅读http://www.gnu.org/s/hello/manual/libc/Aligned-Memory-Blocks.html

Answer 4

posix_memalign是一种方式：http://pubs.opengroup.org/onlinepubs/009695399/functions/posix_memalign.html只要你的大小是2的幂。

您提供的解决方案的问题在于您冒着注销已分配内存的风险。另一种解决方案是分配你想要的大小+ 16并使用类似的技巧来获得一个类似的技巧来获得一个对齐的指针，但仍然在你分配的区域内。也就是说，我会使用posix_memalign作为第一个解决方案。

Answer 5

一些事情：

不要更改malloc / new返回的指针：稍后你需要它来释放内存;
调整对齐后确保缓冲区足够大
使用size_t代替unsigned long，因为size_t保证与指针的大小相同，而不是其他任何内容：

这是代码：

size_t size = 1024; // this is how many bytes you need in the aligned buffer
size_t align = 16;  // this is the alignment boundary
char *p = (char*)malloc(size + align); // see second point above
char *aligned_p = (char*)((size_t)p + (align - (size_t)p % align));
// use the aligned_p here
// ...
// when you're done, call:
free(p); // see first point above

Answer 6

已更新：更快的新算法

不要使用模数，因为由于令人讨厌的划分，在x86上需要花费数百个时钟周期，而在其他系统上则需要更多时间。我想出了比GCC和Visual-C ++更快的std :: align版本。 Visual-C ++的执行速度最慢，实际上使用的是业余条件语句。 GCC与我的算法非常相似，但是我做了相反的事情，但是我的算法快了13.3％，因为它有13条指令，而不是15条单周期指令。 See here is the research paper with dissassembly。如果您使用遮罩而不是pow_2，则该算法实际上快了一条指令。

/* Quickly aligns the given pointer to a power of two boundaries.
@return An aligned pointer of typename T.
@desc Algorithm is a 2's compliment trick that works by masking off
the desired number in 2's compliment and adding them to the
pointer. Please note how I took the horizontal comment whitespace back.
@param pointer The pointer to align.
@param mask Mask for the lower LSb, which is one less than the power of 
2 you wish to align too. */
template <typename T = char>
inline T* AlignUp(void* pointer, uintptr_t mask) {
  intptr_t value = reinterpret_cast<intptr_t>(pointer);
  value += (-value) & mask;
  return reinterpret_cast<T*>(value);
}

您怎么称呼它？

enum { kSize = 256 };
char buffer[kSize + 16];
char* aligned_to_16_byte_boundary = AlignUp<> (buffer, 15); //< 16 - 1 = 15
char16_t* aligned_to_64_byte_boundary = AlignUp<char16_t> (buffer, 63);

这里是3位的快速按位证明，对于所有位计数都相同：

~000 = 111 => 000 + 111 + 1 = 0x1000
~001 = 110 => 001 + 110 + 1 = 0x1000
~010 = 101 => 010 + 101 + 1 = 0x1000
~011 = 100 => 011 + 100 + 1 = 0x1000
~100 = 011 => 100 + 011 + 1 = 0x1000
~101 = 010 => 101 + 010 + 1 = 0x1000
~110 = 001 => 110 + 001 + 1 = 0x1000
~111 = 000 => 111 + 000 + 1 = 0x1000

如果您是在这里学习如何在C ++ 11中对齐对象的高速缓存行，请使用in-place constructor：

struct Foo { Foo () {} };
Foo* foo = new (AlignUp<Foo> (buffer, 63)) Foo ();

这是std :: align实现，它使用24条指令，而GCC实现使用31条指令，尽管可以通过将(--align)转到mask来进行调整以消除减量指令。最低有效位，但其功能与std :: align相同。

inline void* align(size_t align, size_t size, void*& ptr,
                   size_t& space) noexcept {
   intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
           offset = (-int_ptr) & (--align);
  if ((space -= offset) < size) {
    space += offset;
    return nullptr;
  }
  return reinterpret_cast<void*>(int_ptr + offset);
}

使用面罩更快，而不是pow_2

这里是使用遮罩而不是pow_2（这是2的偶数幂）进行对齐的代码。这比GCC算法胖20％，但要求您存储掩码而不是pow_2，因此它不能互换。

inline void* AlignMask(size_t mask, size_t size, void*& ptr,
                   size_t& space) noexcept {
   intptr_t int_ptr = reinterpret_cast<intptr_t>(ptr),
           offset = (-int_ptr) & mask;
  if ((space -= offset) < size) {
    space += offset;
    return nullptr;
  }
  return reinterpret_cast<void*>(int_ptr + offset);
}

如何对齐指针

6 个答案:

试试这个：

警告：

已更新：更快的新算法

使用面罩更快，而不是pow_2