Question

我正在使用C ++编写一个简单的卷积函数，从基本的“滑动窗口”卷积开始，使用常规产品（目前没有FFT内容），直到SEE，AVX和可能的OpenCL。我遇到了SSE的问题。我的代码如下所示：

for (x = 0; x < SIZEX - KSIZEX + 1; ++x)
{
    for (y = 0; y < SIZEY - KSIZEY + 1; ++y)
    {           
        tmp = 0.0f;

        float fDPtmp = 0.0f;
        float *Kp = &K[0];


        for (xi = 0; xi < KSIZEX; ++xi, Kp=Kp+4)
        {                               
            float *Cp = &C[(x+xi)*SIZEY + y];

            __m128 *KpSSE = reinterpret_cast<__m128*>(&K);
            __m128 *CpSSE = reinterpret_cast<__m128*>(&C[(x + xi)*SIZEY + y]);
            __m128 DPtmp = _mm_dp_ps(*KpSSE, *CpSSE, 0xFF);
            _mm_store_ss(&fDPtmp, DPtmp);

            tmp += fDPtmp;
        }

        R[k] = tmp;
        ++k;
    }
}

必要的矩阵就像这样初始化（这些矩阵的大小是可以考虑的，因为更简单的实现工作得很好）：

__declspec(align(16)) float *C = ReadMatrix("E:\\Code\\conv\\C.bin");
__declspec(align(16)) float *K = ReadMatrix("E:\\Code\\conv\\K.bin");
__declspec(align(16)) float *R = new float[CSIZEX*CSIZEY];

代码在y = 1时崩溃，所以我觉得我处理指针的方式可能有误。有趣的是，如果我用_mm_set_ps替换reinterpret_casts，即

__m128 KpSSE = _mm_set_ps(Kp[0], Kp[1], Kp[2], Kp[3]);
__m128 CpSSE = _mm_set_ps(Cp[0], Cp[1], Cp[2], Cp[3]);
__m128 DPtmp = _mm_dp_ps(KpSSE, CpSSE, 0xFF);
_mm_store_ss(&fDPtmp, DPtmp);

整个过程虽然运行得很慢，但我会责怪所有的复制操作。

有人可以指出我到底做错了什么吗？

非常感谢

专利

更新：好的，正如Paul所指出的那样，问题在于ReadMatrix（或者另一种解决方案是使用_mm_loadu_ps）。至于ReadMatrix（），它看起来像这样：

__declspec(align(16)) float* ReadMatrix(string path)
{
    streampos size;

    ifstream file(path, ios::in | ios::binary | ios::ate);

    if (file.is_open())
    {
        size = file.tellg();
        __declspec(align(16)) float *C = new float[size];
        file.seekg(0, ios::beg);
        file.read(reinterpret_cast<char*>(&C[0]), size);
        file.close();

        return C;
    }
    else cout << "Unable to open file" << endl;
}

它没有做到这一点。有没有其他方法可以优雅地执行此操作，而不是被迫逐个读取文件并执行memcpy，我认为应该可以工作？！

更新

之后似乎仍然不想工作

__declspec(align(16)) float* ReadMatrix(string path)
{
    streampos size;

    ifstream file(path, ios::in | ios::binary | ios::ate);

    if (file.is_open())
    {
        size = file.tellg();
        __declspec(align(16)) float *C = static_cast<__declspec(align(16)) float*>(_aligned_malloc(size * sizeof(*C), 16));
        file.seekg(0, ios::beg);
        file.read(reinterpret_cast<char*>(&C[0]), size);
        file.close();

        return C;
    }
    else cout << "Unable to open file" << endl;
}

我在那里添加了static_cast，因为看起来有必要让Paul的代码进行编译（即_aligned_malloc返回一个void指针）。我接近只是用fread读取文件的块并将它们memcpy到一个alligned数组中。：/我再次发现自己在寻求建议。非常感谢你们。

专利

PS：非SSE代码适用于这些数据结构。 _mm_loadu_ps比使用非SSE代码慢。

Answer 1

这不符合你的想法：

__declspec(align(16)) float *C = ReadMatrix("E:\\Code\\conv\\C.bin");

此处对齐指令实现的所有目的是将指针本身（即C）与16字节边界对齐，而不是指针的内容。

您需要修复ReadMatrix以便返回适当对齐的数据，或使用_mm_loadu_ps，正如其他人已经建议的那样。

不要使用_mm_set_ps因为这会产生很多指令，而不像映射到单个指令的_mm_loadu_ps。

更新

你在ReadMatrix中重复了同样的错误：

__declspec(align(16)) float *C = new float[size];

这再次不能保证数据的对齐，只保证指针C本身的对齐。要修复此分配，您可以使用_mm_malloc或_aligned_malloc：

float *C = _mm_malloc(size * sizeof(*C), 16);

或

float *C = _aligned_malloc(size * sizeof(*C), 16);

Answer 2

在ReadMatrix中，您无法保证new表达式返回正确对齐的指针。分配给对齐的指针并不重要（我甚至不确定你的语法是指指针本身是对齐的，还是它指向的对象）。

您需要使用_mm_align或_mm_malloc或其他一些对齐的分配工具。

Answer 3

你不能在这里使用reinterpret_cast，我理解_mmloadu_ps很慢。但还有另一种方式。在对其执行操作之前，展开循环，读入对齐的数据，并在新值中移动并屏蔽。这将是快速和正确的。也就是说，你可以在你的内循环中做这样的事情：

__m128i x = _mm_load_ps(p);
__m128i y = _mm_load_ps(p + sizeof(float));
__m128i z;

// do your operation on x 1st time this iteration here

z = _mm_slli_si128(y, sizeof(float) * 3);
x = _mm_srli_si128(x, sizeof(float));
x = _mm_or_si128(x, z);

// do your operation on x 2nd time this iteration here

z = _mm_slli_si128(y, sizeof(float) * 2);
x = _mm_srli_si128(x, sizeof(float) * 2);
x = _mm_or_si128(x, z);

// do your operation on x 3rd time this iteration here

z = _mm_slli_si128(y, sizeof(float));
x = _mm_srli_si128(x, sizeof(float) * 3);
x = _mm_or_si128(x, z);

// do your operation on x 4th time this iteration here

x = y; // don’t need to read in x next iteration, only y

loopCounter += 4 * sizeof(float);

SSE：reinterpret_cast＆lt; __ m128 *＆gt;而不是_mm_load_ps

3 个答案: