我有一个简单的问题。具有起始uint_32值(比如125)和要添加的__m128i操作数,例如(+ 5,+ 10,-1,-5)。我希望尽可能快地得到一个向量(125 + 5,125 + 5 + 10,125 + 5 + 10 - 1,125 + 5 + 10 - 1 - 5),即从操作数中累加值到起始值。到目前为止,我能想到的唯一解决方案是添加4个__m128i变量。例如,它们将是
/* pseudoSSE code... */
__m128i src = (125,125,125,125)
__m128i operands =(5,10,-1,-5)
/* Here I omit the partitioning of operands into add1,..add4 for brevity */
__m128i add1 = (+05,+05,+05,+05)
__m128i add2 = (+00,+10,+10,+10)
__m128i add3 = (+00,+00,-01,-01)
__m128i add4 = (+00,+00,+00,-05)
__m128i res1 = _mm_add_epu32( add1, add2 )
__m128i res2 = _mm_add_epu32( add3, add4 )
__m128i res3 = _mm_add_epu32( res1, add2 )
__m128i res = _mm_add_epu32( res3, src )
像这样,我得到了我想要的东西。对于此解决方案,我将需要设置所有add_变量,然后执行4次添加。我真正想问的是,这是否可以更快完成。要么通过一些不同的算法,要么使用一些我还不知道的专门的SSE函数(类似于_mm_cumulative_sum())。非常感谢。
答案 0 :(得分:5)
您可以添加更多并行度并使用3个添加而不是4:
const __m128i src = _mm_set1_epi32(125);
const __m128i operands = _mm_set_epi32(5,10,-1,-5);
const __m128i shift1 =
_mm_add_epi32(operands,
_mm_and_si128(_mm_shuffle_epi32(operands, 0xF9),
_mm_set_epi32(0,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF)));
const __m128i shift2 =
_mm_add_epi32(shift1,
_mm_and_si128(_mm_shuffle_epi32(shift1, 0xFE),
_mm_set_epi32(0,0,0xFFFFFFFF,0xFFFFFFFF)));
const __m128i res = _mm_add_epi32(src, shift2);
这里使用SSE2指令集。使用较新的指令集,您可以使用_mm_shuffle_epi8等单个指令替换_mm_and_si128 / _mm_shuffle_epi32。
累计和计算为2次添加,如下所示:
a b c d
+ a b c
------------------
a a+b b+c c+d
+ a a+b
------------------
a a+b a+b+c a+b+c+d
SSE不适合这样的任务。它的性能仅适用于“垂直”操作,但它需要大量额外的工作来进行“水平”操作,这里需要它。
答案 1 :(得分:1)
谢谢大家的帮助。决定找出哪个版本最快,我写了一个测试应用程序。
1 / nonSSE版本就像你期望的那样完成所有的工作。
int iRep;
int iCycle;
int iVal = 25;
int a1, a2, a3, a4;
int dst1 [4];
for ( iCycle = 0; iCycle < CYCLE_COUNT; iCycle++ )
for ( iRep = 0; iRep < REP_COUNT; iRep++ )
{
a1 = a2 = a3 = a4 = iRep;
dst1[0] = iVal + a1;
dst1[1] = dst1[0] + a2;
dst1[2] = dst1[1] + a3;
dst1[3] = dst1[2] + a4;
}
2 / SSE-4添加符合我的建议,即
__m128i _a1, _a2, _a3, _a4;
__m128i _res1, _res2, _res3;
__m128i _val;
__m128i _res;
for ( iCycle = 0; iCycle < CYCLE_COUNT; iCycle++ )
for ( iRep = 0; iRep < REP_COUNT; iRep++ ){
a1 = a2 = a3 = a4 = iRep;
_val = _mm_set1_epi32( iVal );
_a1 = _mm_set_epi32 (a1, a1, a1, a1 );
_a2 = _mm_set_epi32 (a2, a2, a2, 0 );
_a3 = _mm_set_epi32 (a3, a3, 0 , 0 );
_a4 = _mm_set_epi32 (a4, 0 , 0 , 0 );
_res1 = _mm_add_epi32( _a1, _a2 );
_res2 = _mm_add_epi32( _a3, _a4 );
_res3 = _mm_add_epi32( _val, _res1 );
_res = _mm_add_epi32( _res3, _res2 );
}
3 / SSE-3的添加做了Evgeny提出的建议,即
__m128i shift1, shift2, operands ;
for ( iCycle = 0; iCycle < CYCLE_COUNT; iCycle++ )
for ( iRep = 0; iRep < REP_COUNT; iRep++ ){
a1 = a2 = a3 = a4 = iRep;
_val = _mm_set1_epi32( iVal );
operands = _mm_set_epi32(a1,a2,a3,a4);
shift1 = _mm_add_epi32( operands,
_mm_and_si128(_mm_shuffle_epi32(operands, 0xF9), _mm_set_epi32(0,0xFFFFFFFF,0xFFFFFFFF,0xFFFFFFFF) ));
shift2 = _mm_add_epi32( shift1,
_mm_and_si128(_mm_shuffle_epi32(shift1, 0xFE), _mm_set_epi32(0,0,0xFFFFFFFF,0xFFFFFFFF) ));
_res = _mm_add_epi32(_val, shift2);
}
的结果
#define REP_COUNT 100000
#define CYCLE_COUNT 100000
是
non-SSE -> 6.118s
SSE-4additions -> 20.775s
SSE-3additions -> 14.873s
相当令人惊讶......