Question

我正在开发CPU-FPGA协处理框架，因此我需要完全控制数据的对齐方式。我有一个仅需要5个字节的数据结构：

typedef struct __attribute__ ((packed))  {
    uint32_t dst;
    uint8_t weight;
} edg_t;

我的FPGA接口每个周期可以读取1条高速缓存行（64字节）（每秒2亿次读取）。对我的性能至关重要的是，将尽可能多的元素填充到一条缓存行中，因此无法填充结构。

5个字节：12个元素/读取
8个字节：8个元素/读取（已填充）
填充->性能降低1.5倍

但是，我无法在高速缓存行之间跨过结构，这要求我在FPGA上构建逻辑以不断移动读取的数据。

构建缓冲区时，我当前的解决方案如下：

int num_elements = 1000;
int num_cachelines = num_elements / 12 + 1;

uint8_t* buffer = new uint8_t[num_cachelines * 64]
uint8_t* buf_ptr = buffer - 4;

for (int i = 0; i < num_elements; i++) {
    if (i % 12 == 0) buf_ptr += 4; //skip the last 4 bytes of each cache-line

    edg_t* edg_ptr = (edg_t*) buf_ptr;
    edg_ptr->dst = i; //example, I have random generators here
    edg_ptr->weight = i % 256;
    buf_ptr++;

}

现在当FPGA独自完成所有工作时很好，现在我希望FPGA和CPU能够合作。这意味着CPU现在也必须读取缓冲区。

我想知道是否存在更好的方法来让编译器自动处理填充，还是像我在上面的缓冲区创建代码中那样，每次都必须手动跳过字节？

Answer 1

我假设您将创建一次该缓冲区结构，然后一遍又一遍地填充它，以供FPGA读取（反之亦然）。如果是这样，则此布局应该可以工作：

constexpr size_t cacheline_size = 64;
constexpr size_t num_elements = 1000;

struct __attribute__ ((packed)) edg_t  {
    /*volatile*/ uint32_t dst;   // volatile if the FPGA writes too
    /*volatile*/ uint8_t weight;
};

constexpr size_t elements_per_cachline = cacheline_size/sizeof(edg_t);
constexpr size_t num_cachelines = num_elements / elements_per_cachline + 1;

struct alignas(cacheline_size) cacheline_t {
    std::array<edg_t, elements_per_cachline> edg;
    inline auto begin() { return edg.begin(); }
    inline auto end() { return edg.end(); }
};

struct cacheline_collection_t {
    std::array<cacheline_t, num_cachelines> cl;
    inline void* address_for_fpga() { return this; }
    inline auto begin() { return cl.begin(); }
    inline auto end() { return cl.end(); }
};

int main() {
    cacheline_collection_t clc;
    std::cout << "edg_t                 : "
       << alignof(edg_t) << " " << sizeof(clc.cl[0].edg[0]) << "\n";
    std::cout << "cacheline_t           : "
       << alignof(cacheline_t) << " " << sizeof(clc.cl[0]) << "\n";
    std::cout << "cacheline_collection_t: "
       << alignof(cacheline_collection_t) << " " << sizeof(clc) << "\n";

    // access
    for(auto& cl : clc) {
        for(auto& edg : cl) {
            std::cout << edg.dst << " " << (unsigned)edg.weight << "\n";
        }
    }
}

assembly @ godbolt看起来不错。内部循环已完全内联到12个代码块，其中每个块的rax偏移量增加了5。然后在3个操作中（有条件地）转到下一个缓存行：

    add     rax, 64
    cmp     rax, rcx
    jne     .LBB0_1

C ++将5个字节的结构对齐到缓存行

1 个答案: