Question

我正在使用SSE / SSE2（甚至SSE3 / 4 / AVX ...）通过16位整数流来开发更好的Hermite插值代码

到目前为止，我运行得很好，但是我想知道是否可以进一步优化它。而且我也想知道是否可以更快地加载16位整数数据。

谢谢您的建议。

这是原始的Hermite插值代码。

Hermite Interpolation
//
public static float InterpolateHermite4pt3oX(float x0, float x1, float x2, float x3, float t)
{
    float c0 = x1;
    float c1 = .5F * (x2 - x0);
    float c2 = x0 - (2.5F * x1) + (2 * x2) - (.5F * x3);
    float c3 = (.5F * (x3 - x0)) + (1.5F * (x1 - x2));
    return (((((c3 * t) + c2) * t) + c1) * t) + c0;
}

到目前为止，这是我的SSE代码。

static __m128 S0, S1, S2, S3;
static __m128 dot5 = _mm_set1_ps(0.5f);
static __m128 TwoDot5 = _mm_set1_ps(2.5f);
static __m128 OneDot5 = _mm_set1_ps(1.5f);
static __m128 One = _mm_set1_ps(1.0f);
static __m128 Two = _mm_set1_ps(2.0f);
static __m128 mul16b = _mm_set1_ps(BITS_16_MULT);

#define HIC0 S1
#define HIC1 _mm_mul_ps(dot5, _mm_sub_ps(S2, S0))
#define HIC2 _mm_sub_ps(_mm_add_ps(_mm_sub_ps(S0,  _mm_mul_ps(TwoDot5, S1)), _mm_mul_ps(Two, S2)), _mm_mul_ps(dot5, S3))
#define HIC3 _mm_add_ps(_mm_mul_ps(dot5, _mm_sub_ps(S3, S0)), _mm_mul_ps(OneDot5, _mm_sub_ps(S1, S2)))

#define HICRETURN _mm_add_ps(_mm_mul_ps(_mm_add_ps(_mm_mul_ps(_mm_add_ps(_mm_mul_ps(HIC3, fract), HIC2), fract), HIC1), fract), HIC0)

__m128 fract = _mm_set1_ps(fractPos);
_mm_store_ps(tempWave, HICRETURN);

S0至S3是样本，Sample0至Sample3。 FractPos是从一个样本到下一个样本的分数位置。

为了阅读样本，我使用：

int16* xData = (int16*)sampleData16Bits.getData();
tempWave[0] = float(xData[newPosition]);
tempWave[1] = float(xData[newPosition + 1]);
tempWave[2] = float(xData[newPosition + 2]);
tempWave[3] = float(xData[newPosition + 3]);
S0 = _mm_mul_ps(_mm_load_ps(tempWave), mul16b);
S1 = _mm_shuffle_ps(S0, S0, 0x39);
S2 = _mm_shuffle_ps(S1, S1, 0x39);
S3 = _mm_shuffle_ps(S2, S2, 0x39);

Hermite插值和16位整数流

0 个答案: