Question

我有一个12位数据的缓冲区（存储在16位数据中）并需要转换为8位（移位4）

NEON如何加速此处理？

感谢您的帮助

卜拉欣

Answer 1

自由地承担了下面解释的一些事情，但是这种代码（未经测试，可能需要一些修改）应该比天真的非NEON版本提供更好的加速：

#include <arm_neon.h>
#include <stdint.h>

void convert(const restrict *uint16_t input, // the buffer to convert
             restrict *uint8_t output,       // the buffer in which to store result
             int sz) {                       // their (common) size

  /* Assuming the buffer size is a multiple of 8 */
  for (int i = 0; i < sz; i += 8) {
    // Load a vector of 8 16-bit values:
    uint16x8_t v = vld1q_u16(buf+i);
    // Shift it by 4 to the right, narrowing it to 8 bit values.
    uint8x8_t shifted = vshrn_n_u16(v, 4);
    // Store it in output buffer
    vst1_u8(output+i, shifted);
  }

}

我在这里假设的事情：

您正在使用无符号值。如果不是这样，那么无论如何都很容易适应（uint* - ＆gt; int*，*_u8 - ＆gt; *_s8和*_u16 - ＆gt; *_s16）
当值加载8乘8时，我假设缓冲区长度是8的倍数以避免边缘情况。如果不是这样，你应该人为地将其填充到8的倍数。

最后，使用了NEON文档中的2个资源页面：

约loads and stores个向量。
关于shifting vectors。

希望这有帮助！

Answer 2

prototype : void dataConvert(void * pDst, void * pSrc, unsigned int count);
    1:
    vld1.16 {q8-q9}, [r1]!
    vld1.16 {q10-q11}, [r1]!
    vqrshrn.u16 d16, q8, #4
    vqrshrn.u16 d17, q9, #4
    vqrshrn.u16 d18, q10, #4
    vqrshrn.u16 d19, q11, #4
    vst1.16 {q8-q9}, [r0]!
    subs r2, #32
    bgt 1b

q flag：饱和度

r flag：rounding

如果签名数据，将u16更改为s16。

NEON加速为12位至8位

2 个答案: