Question

我正在为二进制文件编写解析器。数据存储在连续的32位记录中。这些文件只需要读取一次，这样就可以将其输入分析算法。

现在我正在读取1024个记录的块中的文件，以避免尽可能多地调用fread所需的开销。在下面的示例中，我使用oflcorrection，timetag和channel作为算法的输出，并使用bool返回值来检查算法是否应该停止。另请注意，并非所有记录都包含具有正值的光子。

使用这种方法，如果我使用将文件分成几部分的算法的线程版本，我可以处理高达0.5GBps或1.5 GBps。我知道我的SSD可以读取至少快40％。我正在考虑使用SIMD并行解析几个记录，但我不知道如何使用条件返回子句。

你知道任何其他方法可以让我将chunked阅读和SIMD结合起来吗？通常有更好的方法吗？

由于

P.S。记录对应于通过分束器后到达检测器的光子或指示溢出情况的特殊记录。后者是必需的，因为Timetags在uint64_t中以皮秒分辨率存储。

 static inline bool next_photon(FILE* filehandle, uint64_t * RecNum,
                               uint64_t StopRecord, record_buf_t *buffer,
                               uint64_t *oflcorrection, uint64_t *timetag, int *channel)
{
    pop_record:
    while (__builtin_unpredictable(buffer->head < RECORD_CHUNK)) { // still have records on buffer
        ParseHHT2_HH2(buffer->records[buffer->head], channel, timetag, oflcorrection);
        buffer->head++;
        (*RecNum)++;

        if (*RecNum >= StopRecord) { // run out of records
            return false;
        }

        if (*channel >= 0) { // found a photon
            return true;
        }
    }
    // run out of buffer
    buffer->head = 0;
    fread(buffer->records, RECORD_CHUNK, sizeof(uint32_t), filehandle);
    goto pop_record;
}

请在下面找到解析功能。请记住，我对文件格式无能为力。再次感谢Guillem。

static inline void ParseHHT2_HH2(uint32_t record, int *channel,
                                 uint64_t *timetag, uint64_t *oflcorrection)
{
    const uint64_t T2WRAPAROUND_V2 = 33554432;
    union{
        uint32_t   allbits;
        struct{ unsigned timetag  :25;
            unsigned channel  :6;
            unsigned special  :1;
        } bits;
    } T2Rec;

    T2Rec.allbits = record;

    if(T2Rec.bits.special) {
        if(T2Rec.bits.channel==0x3F) {  //an overflow record
            if(T2Rec.bits.timetag!=0) {
                *oflcorrection += T2WRAPAROUND_V2 * T2Rec.bits.timetag;
            }
            else {  // if it is zero it is an old style single overflow
                *oflcorrection += T2WRAPAROUND_V2;  //should never happen with new Firmware!
            }
            *channel = -1;
        } else if(T2Rec.bits.channel == 0) {  //sync
            *channel = 0;
        } else if(T2Rec.bits.channel<=15) {  //markers
            *channel = -2;
        }
    } else {//regular input channel
        *channel = T2Rec.bits.channel + 1;
    }
    *timetag = *oflcorrection + T2Rec.bits.timetag;
}

我提出了一个几乎无分支的解析函数，但它不会产生任何加速。

if(T2Rec.bits.channel==0x3F) {  //an overflow record
        *oflcorrection += T2WRAPAROUND_V2 * T2Rec.bits.timetag;
    }
    *channel = (!T2Rec.bits.special) * (T2Rec.bits.channel + 1) - T2Rec.bits.special * T2Rec.bits.channel;
    *timetag = *oflcorrection + T2Rec.bits.timetag;
}

Answer 1

I / O非常可能支配您的函数运行时。也就是说，首先应该在不解析的情况下测量速度，即仅fread。可能它与包括解析在内的速度差别不大。

如果是这样，您可以先集中精力优化该瓶颈。查看linux工具fio，特别是使用不同的--ioenginge=（也是libaio）。如果您使用的是NVMe磁盘，请查看Intel SPDK。

除此之外，您还可以进一步优化解析。您可以避免(*RecNum)++，更重要的是避免循环中的第一个if子句，因为在fread之后您知道将要读取多少条记录，因此您可以使用该信息。

此外，我不会迭代buffer->head，而是使用一个局部变量，使用for循环。

我还会为*RecNum使用局部变量，并且仅在末尾分配给*RecNum。如果您的目标是并行写入*RecNum，那么您的代码无论如何都会有错误，因为您的增量和读取都不会使用原子操作。

在此之前，您应该开始考虑SSE或AVX。如果*channel中大多数为零，则可以使用SSE / AVX一次检查16个或更多字节，以获得大于或等于零。

更新：
现在，在提供了解析函数的代码之后，我可以看到情况有所不同。那里有很多分店......

更新：
以下是我所说的next_photon优化的实现。如果在输入buffer->head == 0时保证next_photon，则可以简化它。我假设你没有故意检查fread的返回值，因为你只想用StopRecord处理它。所以我就这样离开了，即使它不安全。

static inline bool next_photon(FILE* filehandle, uint64_t *RecNum,
                            uint64_t StopRecord, record_buf_t *buffer,
                            uint64_t *oflcorrection, uint64_t *timetag,
                            int *channel)
{
    int recNum = *RecNum;
    int i = buffer->head;

    while (true) {
        int records;
        bool quit;

        if (StopRecord - recNum <= RECORD_CHUNK - i) {
            records = i + StopRecord - recNum;
            quit = true;
        } else {
            records = RECORD_CHUNK;
            quit = false;
        }

        const int i0 = i;

        for (; i < records; i++) { // still have records on buffer
            ParseHHT2_HH2(buffer->records[i], channel, timetag, oflcorrection);

            if (*channel >= 0) { // found a photon
                *RecNum = recNum + i - i0 + 1;
                buffer->head = i + 1;
                return true;
            }
        }

        recNum += records - i0;

        if (quit) {
            break;
        }

        // run out of buffer
        i = 0;
        fread(buffer->records, RECORD_CHUNK, sizeof(uint32_t), filehandle);    
    }

    *RecNum = recNum;
    buffer->head = i;
}

Answer 2

您正在循环访问磁盘，我不认为SIMD会在那里提供太多帮助，您可以使用mmap。

检查这些答案：

When should I use mmap for file access?

Fastest file reading in C

但您也可以将SIMD（SSE / AVX / NEON）用于其他部分，例如解析代码

Answer 3

通过并行化加速数据分析对程序的吞吐量产生如此巨大的影响，表明数据分析成本与I / O成本处于同一数量级。因此，如果您希望将其吞吐量提高到接近可用I / O带宽所施加的限制，那么最好的做法可能是并行执行分析和I / O.

你可以通过维护两个独立的I / O缓冲区来做到这一点，在读入另一个时处理一个，然后翻转。

快速二进制解析器算法

3 个答案: