在SSE2中使用%?

时间:2019-01-02 10:11:31

标签: c++ intrinsics sse2

这是我要转换为SSE2的代码:

double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
double *left = audioLeft;
double *right = audioRight;
double phase = 0.0;
double bp0 = mNoteFrequency * mHostPitch;

for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
    // some other code (that will use phase)

    phase += std::clamp(mRadiansPerSample * (bp0 * pB[sampleIndex] + pC[sampleIndex]), 0.0, PI);

    while (phase >= TWOPI) { phase -= TWOPI; }
}

这就是我所取得的成就:

double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
double *left = audioLeft;
double *right = audioRight;
double phase = 0.0;
double bp0 = mNoteFrequency * mHostPitch;

__m128d v_boundLower = _mm_set1_pd(0.0);
__m128d v_boundUpper = _mm_set1_pd(PI);
__m128d v_bp0 = _mm_set1_pd(bp0);
__m128d v_radiansPerSample = _mm_set1_pd(mRadiansPerSample);

__m128d v_phase = _mm_set1_pd(phase);
__m128d v_pB = _mm_load_pd(pB);
__m128d v_pC = _mm_load_pd(pC);
__m128d v_result = _mm_mul_pd(v_bp0, v_pB);
v_result = _mm_add_pd(v_result, v_pC);
v_result = _mm_mul_pd(v_result, v_radiansPerSample);
v_result = _mm_max_pd(v_result, v_boundLower);
v_result = _mm_min_pd(v_result, v_boundUpper);

for (int sampleIndex = 0; sampleIndex < roundintup8(blockSize); sampleIndex += 8, pB += 8, pC += 8) {
    // some other code (that will use v_phase)

    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 2);
    v_pC = _mm_load_pd(pC + 2);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);
    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 4);
    v_pC = _mm_load_pd(pC + 4);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);
    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 6);
    v_pC = _mm_load_pd(pC + 6);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);
    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 8);
    v_pC = _mm_load_pd(pC + 8);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);

    // ... fmod?
}

但是我不确定如何替换while (phase >= TWOPI) { phase -= TWOPI; }(基本上是C ++中的经典fmod)。

任何奇特的内在函数?在此list上找不到任何内容。 师+某种火箭的移位?

1 个答案:

答案 0 :(得分:4)

正如评论所言,看起来您可以使用比较+ andpd使其成为蒙版减法。只要您能回到期望的范围之内,就不能超过一个。

喜欢

const __m128d v2pi = _mm_set1_pd(TWOPI);


__m128d needs_range_reduction = _mm_cmpge_pd(vphase, v2pi);
__m128d offset = _mm_and_pd(needs_range_reduction, v2pi);  // 0.0 or 2*Pi
vphase = _mm_sub_pd(vphase, offset);

要实现实际的(慢速)fmod,而又不必过多担心有效位数的最后几位,您可以使用integer_quotient = floor(x/y)(或者选择rint(x/y)或{{1} }),然后输入ceilx - y * integer_quotient / floor / rint与SSE4.1 ceil_mm_round_pd相比很便宜。这将为您提供余数,就像整数除法一样,它可以是负数。

我敢肯定,有一些数值技术可以更好地避免在灾难性抵消之前减去两个附近的数字而舍入误差。如果您关心精度,请检查一下。 (当您不太关心精度时使用_mm_floor_pd()向量是很愚蠢的;最好使用double并且每个向量完成两倍的工作)。如果输入比模数大得多,则不可避免地会损失精度,并且最小化临时中的舍入误差可能非常重要。但是,否则,除非您关心float几乎是x的精确倍数时结果中的相对误差非常接近零,否则精度只会成为问题。 (接近零的结果,仅剩下有效位数的后几位以保持精度。)

如果没有SSE4.1,则有一些技巧,例如添加然后减去足够大的数字。对于y而言,转换为整数然后返回的情况甚至更糟,因为打包转换指令也解码为一些随机码。更不用说32位整数不能覆盖pd的整个范围,但是如果您的输入量如此之大,则您会迷恋于范围缩小的精度。

如果您有FMA,则可以避免乘法和子运算的double部分中的舍入误差。 y * integer_quotient