Question

我想测试处理SSE/SSE2的{{1}}增强功能。由于OpenCV's Mat性能增强仅对16字节对齐数据很明显，（1）我需要修改Mat矩阵以与SSE's寄存器一起使用？我做了如下，（2）是一种正确的方法吗？

SSE

根据讨论here，我在设置void test(Mat flowxy, Mat flowresult) { __m128 x, y, xsquare, ysquare, ybyx, xRecip , sum, r, theta ;//gen is for general purpose float *input = (float*)(flowxy.data); for(int i = 0; i < flowxy.rows; i++) { for(int j = 0; j + SSE_INCREMENT < flowxy.cols; j = j + SSE_INCREMENT) { x = _mm_set_ps(input[flowxy.step * (j+6) + i ], input[flowxy.step * (j+4) + i ], input[flowxy.step * (j+2) + i ], input[flowxy.step * (j) + i ]); y = _mm_set_ps(input[flowxy.step * (j+7) + i ], input[flowxy.step * (j+5) + i ], input[flowxy.step * (j+3) + i ], input[flowxy.step * (j+1) + i ]); xRecip = _mm_rcp_ps(x); xsquare = _mm_mul_ps(x, x); ysquare = _mm_mul_ps(y, y); ybyx = _mm_mul_ps(xRecip , y); sum = _mm_add_ps(xsquare, ysquare); r = _mm_sqrt_ps(sum); theta = taninverse(ybyx); } } }时颠倒了顺序。

编辑1：

_mm_set_ps

Answer 1

编译器可能无论如何都会对此代码进行矢量化，因此您可能无法通过显式向量化获得任何内容 - 查看标量分支的生成代码并查看它是否包含SSE指令。还要注意旧CPU上未对齐的加载/存储是非常昂贵的（如果这是例如Core i7，你应该没问题。）

OpenCV中的Mat矩阵和SSE的16字节对齐

1 个答案: