从固定大小的字节缓冲区的连续块中解析protobuf消息的序列

时间:2015-03-20 02:38:43

标签: c++ protocol-buffers protobuf-c

由于我对C ++的了解不多,我已经连续两天苦苦挣扎。我需要做的是使用protobuf C ++ API从一个大文件解析消息序列,这个文件可能包含数百万个这样的消息。直接从文件中读取很容易,因为我总能做到" ReadVarInt32"获取大小然后执行ParseFromCodedStream,并在CodedInputStream上按下限制,如this post中所述。但是,我正在使用的I / O级API(实际上是libuv)需要为每个读回调操作分配固定大小的缓冲区。显然,块大小与我正在读出的消息大小无关。

这让我的生活变得艰难。基本上每当我从文件中读取并填写固定大小的缓冲区(比如16K)时,该缓冲区可能包含数百个完整的protobuf消息,但该缓冲区的最后一个块可能是不完整的消息。所以我想,好吧我应该做的是尝试尽可能多地阅读消息,并最后提取最后一个块并将其附加到我读出的下一个16K缓冲区的开头,继续前进直到我达到EOF为止文件。我使用ReadVarInt32()来获取大小,然后将该数字与缓冲区大小的其余部分进行比较,如果消息大小较小,则继续读取。

有一个名为GetDirectBufferPointer的API,因此我尝试使用它来记录指针位置之前我甚至读出下一个消息的大小。但是我怀疑由于字节序的怪异,如果我只是从指针开始的地方提取剩余的字节数组并附加到下一个块,Parse就不会成功,实际上前几个字节(我认为是8个)完全是搞砸了。

或者,如果我执行codedStream.ReadRaw()并将残余流写入缓冲区然后附加到新块的头部,则数据不会被破坏。但问题是这次我将失去"尺寸"字节信息,因为它已经被"读"在" ReadVarInt32"!即使我只是继续记住我上次读取的大小信息并直接调用下一个迭代消息.ParseFromCodedStream(),它最终读取少一个字节,有些部分甚至被破坏,无法成功恢复对象。 / p>

std::vector<char> mCheckBuffer;
std::vector<char> mResidueBuffer;
char bResidueBuffer[READ_BUFFER_SIZE];
char temp[READ_BUFFER_SIZE];
google::protobuf::uint32 size;
//"in" is the file input stream
while (in.good()) {
    in.read(mReadBuffer.data(), READ_BUFFER_SIZE);
    mCheckBuffer.clear();
    //merge the last remaining chunk that contains incomplete message with
    //the new data chunk I got out from buffer. Excuse my terrible C++ foo
    std::merge(mResidueBuffer.begin(), mResidueBuffer.end(),  
    mReadBuffer.begin(), mReadBuffer.end(), std::back_inserter(mCheckBuffer));

    //Treat the new merged buffer array as the new CIS
    google::protobuf::io::ArrayInputStream ais(&mCheckBuffer[0], 
    mCheckBuffer.size());
    google::protobuf::io::CodedInputStream cis(&ais);
    //Record the pointer location on CIS in bResidueBuffer
    cis.GetDirectBufferPointer((const void**)&bResidueBuffer,
    &bResidueBufSize);

    //No size information, probably first time or last iteration  
    //coincidentally read a complete message out. Otherwise I simply 
    //skip reading size again as I've already populated that from last 
    //iteration when I got an incomplete message
    if(size == 0) {
         cis.ReadVarint32(&size);
    }
    //Have to read this again to get remaining buffer size
    cis.GetDirectBufferPointer((const void**)&temp, &mResidueBufSize);

    //Compare the next message size with how much left in the buffer, if      
    //message size is smaller, I know I can read at least one more message 
    //out, keep reading until I run out of buffer, or, it's the end of message 
    //and my buffer just allocated larger so size should be 0
    while (size <= mResidueBufSize && size != 0) {
        //If this cis I constructed didn't have the size info at the beginning, 
        //and I just read straight from it hoping to get the message out from 
        //the "size" I got from last iteration, it simply doesn't work
        //(read one less byte in fact, and some part of the message corrupted)
        //push the size constraint to the input stream;
        int limit = cis.PushLimit(size);
        //parse message from the input stream
        message.ParseFromCodedStream(&cis);  
        cis.PopLimit(limit);
        google::protobuf::TextFormat::PrintToString(message, &str);
        printf("%s", str.c_str());
        //do something with the parsed object
        //Now I have to record the new pointer location again
        cis.GetDirectBufferPointer((const void**)&bResidueBuffer, 
        &bResidueBufSize);
        //Read another time the next message's size and go back to while loop check
        cis.ReadVarint32(&size);

    }
    //If I do the next line, bResidueBuffer will have the correct CIS information 
    //copied over, but not having the "already read" size info
    cis.ReadRaw(bResidueBuffer, bResidueBufSize);
    mResidueBuffer.clear();
    //I am constructing a new vector that receives the residual chunk of the 
    //current buffer that isn't enough to restore a message
    //If I don't do ReadRaw, this copy completely messes up at least the first 8 
    //bytes of the copied buffer's value, due to I suspect endianness
    mResidueBuffer.insert(mResidueBuffer.end(), &bResidueBuffer[0], 
    &bResidueBuffer[bResidueBufSize]);
}

我现在真的不在乎。甚至可以优雅地使用protobuf和需要固定大小的中间缓冲区的API吗?非常感谢任何投入,谢谢!

2 个答案:

答案 0 :(得分:1)

我发现您的代码存在两个主要问题:

std::merge(mResidueBuffer.begin(), mResidueBuffer.end(),  
mReadBuffer.begin(), mReadBuffer.end(), std::back_inserter(mCheckBuffer));

看起来你期望std::merge连接你的缓冲区,但实际上这个函数在MergeSort的意义上将两个有序数组合并到一个有序数组中。在这种情况下,这没有任何意义; mCheckBuffer最终会包含废话。

cis.GetDirectBufferPointer((const void**)&bResidueBuffer,
&bResidueBufSize);

在这里,您将&bResidueBuffer转换为不兼容的指针类型。 bResidueBuffer是一个char数组,因此&bResidueBuffer是一个指向char数组的指针,它是不是指针的指针。这无疑是令人困惑的,因为数组可以隐式转换为指针(指针指向数组的第一个元素),但这实际上是转换 - bResidueBuffer本身 a指针,它只能转换为一个。

我认为你也误解了GetDirectBufferPointer()的作用。看起来您希望它将缓冲区的其余部分复制到bResidueBuffer,但该方法永远不会复制任何数据。该方法返回一个指向原始缓冲区的指针。

调用它的正确方法如下:

const void* ptr;
int size;
cis.GetDirectBufferPointer(&ptr, &size);

现在ptr将指向原始缓冲区。您现在可以将它与指向缓冲区开头的指针进行比较,以找出您在流中的位置,例如:

size_t pos = (const char*)ptr - &mCheckBuffer[0];

但是,你不应该这样做,因为CodedInputStream已经有方法CurrentPosition()用于此目的。这将返回缓冲区中的当前字节偏移量。所以,请改用它。

答案 1 :(得分:0)

好的,感谢Kenton在我的问题中指出主要问题的帮助,我现在修改了代码片段并对其进行了测试。我会在这里发布我的解决方案。然而,据说,我对这里需要做的所有复杂性和边缘情况检查感到不满。我认为这很容易出错。即便如此,我可能真正做的就是写我的直接&#34;从流中读取&#34;在我的libuv主线程之外的另一个线程中阻塞调用,所以我不需要使用libuv API。但为了完整起见,这是我的代码:

std::vector<char> mCheckBuffer;
std::vector<char> mResidueBuffer;
std::vector<char> mReadBuffer(READ_BUFFER_SIZE);
google::protobuf::uint32 size;
//"in" is the file input stream
while (in.good()) {
    //This part is tricky as you're not guaranteed that what end up in 
    //mReadBuffer is everything you read out from the file. The same 
    //happens with libuv's assigned buffer, after EOF, what's rest in 
    //the buffer could be anything
    in.read(mReadBuffer.data(), READ_BUFFER_SIZE);
    //merge the last remaining chunk that contains incomplete message with
    //the new data chunk I got out from buffer. I couldn't find a more 
    //efficient way doing that
    mCheckBuffer.clear();
    mCheckBuffer.reserve(mResidueBuffer.size() + mReadBuffer.size());
    mCheckBuffer.insert(mCheckBuffer.end(), mResidueBuffer.begin(),
    mResidueBuffer.end());
    mCheckBuffer.insert(mCheckBuffer.end(), mReadBuffer.begin(),
    mReadBuffer.end());
    //Treat the new merged buffer array as the new CIS
    google::protobuf::io::ArrayInputStream ais(&mCheckBuffer[0], 
    mCheckBuffer.size());
    google::protobuf::io::CodedInputStream cis(&ais);
    //No size information, probably first time or last iteration  
    //coincidentally read a complete message out. Otherwise I simply 
    //skip reading size again as I've already populated that from last 
    //iteration when I got an incomplete message
    if(size == 0) {
        cis.ReadVarint32(&size);
    }
    bResidueBufSize = mCheckBuffer.size() - cis.CurrentPosition();
    //Compare the next message size with how much left in the buffer, if      
    //message size is smaller, I know I can read at least one more message 
    //out, keep reading until I run out of buffer. If, it's the end of message 
    //and size (next byte I read from stream) happens to be 0, that
    //will trip me up, cos when I push size 0 into PushLimit and then try 
    //parsing, it will actually return true even if it reads nothing. 
    //So I can get into an infinite loop, if I don't do the check here
    while (size <= bResidueBufSize && size != 0) {
        //If this cis I constructed didn't have the size info at the 
        //beginning, and I just read straight from it hoping to get the  
        //message out from the "size" I got from last iteration
        //push the size constraint to the input stream
        int limit = cis.PushLimit(size); 
        //parse the message from the input stream
        bool result = message.ParseFromCodedStream(&cis);  
        //Parse fail, it could be because last iteration already took care
        //of the last message and that size I read last time is just junk
        //I choose to only check EOF here when result is not true, (which
        //leads me to having to check for size=0 case above), cos it will
        //be too many checks if I check it everytime I finish reading a 
        //message out
        if(!result) {
            if(in.eof()) {
                log.info("Reached EOF, stop processing!");
                break;
            }
            else {
                log.error("Read error or input mal-formatted! Log error!");
                exit;
            }
        }
        cis.PopLimit(limit);
        google::protobuf::TextFormat::PrintToString(message, &str);
        //Do something with the message

        //This is when the last message read out exactly reach the end of 
        //the buffer and there is no size information available on the 
        //stream any more, in which case size will need to be reset to zero
        //so that the beginning of next iteration will read size info first
        if(!cis.ReadVarint32(&size)) {
            size = 0;
        }
        bResidueBufSize = mCheckBuffer.size() - cis.CurrentPosition();
    }
    if(in.eof()) {
        break;
    }
    //Now I am copying the residual buffer into the intermediate
    //mResidueBuffer, which will be merged with newly read data in next iteration
    mResidueBuffer.clear();
    mResidueBuffer.reserve(bResidueBufSize);
    mResidueBuffer.insert(mResidueBuffer.end(), 
    &mCheckBuffer[cis.CurrentPosition()],&mCheckBuffer[mCheckBuffer.size()]);
}
if(!in.eof()) {
    log.error("Something else other than EOF happened to the file, log error!");
    exit;
}