Question

我希望使用SSE / SSE2指令优化 for 循环，以便更好地进行图像压缩。

size_t height = get_height();
size_t width = get_width();
size_t total_size = height * width * 3;
uint8_t *src = get_pixels();
uint8_t *dst = new uint8_t[total_size / 6];
uint8_t *tmp = dst;
rgb_t block[16];

if (height % 4 != 0 || width % 4 != 0) {
    cerr << "Texture compression only supported for images if width and height are multiples of 4" << endl;
    return;
}

// Split image in 4x4 pixels zones
for (unsigned y = 0; y < height; y += 4, src += width * 3 * 4) {
    for (unsigned x = 0; x < width; x += 4, dst += 8) {
        const rgb_t *row0 = reinterpret_cast<const rgb_t*>(src + x * 3);
        const rgb_t *row1 = row0 + width;
        const rgb_t *row2 = row1 + width;
        const rgb_t *row3 = row2 + width;

        // Extract 4x4 matrix of pixels from a linearized matrix(linear memory).
        memcpy(block, row0, 12);
        memcpy(block + 4, row1, 12);
        memcpy(block + 8, row2, 12);
        memcpy(block + 12, row3, 12);

        // Compress block and write result in dst.
        compress_block(block, dst);
    }
}

当一行应该有4个3字节的元素时，如何用sse / sse2寄存器从矩阵读取整行？ rgb_t 结构有3个 uint_t 变量。

Answer 1

为什么你认为编译器还没有为这些12字节拷贝编写好代码？

但是如果它没有，可能复制前三个副本（有重叠）的16个字节将让它使用SSE向量。填充你的数组会让你用16字节的memcpy做最后一次复制，它也应该编译成一个16字节的向量加载/存储：

alignas(16) rgb_t buf[16 + 4];

对齐可能并不重要，因为只有第一家商店才会对齐。但它也可能有助于你将缓冲区传递给函数。

SSE2代码优化来压缩图像

1 个答案: