大文件上的多部分/表单数据丢失字节

时间:2018-07-20 01:20:46

标签: c++ multipartform-data fastcgi lighttpd

由于可用的选项似乎非常稀缺,因此我正在用C ++写一个multipart/form-data解析器。

我最初的方法是使用istream::getline一次缓冲一行(或部分行),以便可以检测边界。但是,虽然这适用于较小的文件,但不适用于较大的文件。对于大文件(> 50MB),偶尔会设置cin的错误位,并且在清除istream之后,我注意到我会丢失字节。我不知道为什么,这就是这个问题的目的。

但是,如果我将缓冲区大小增加到4MB并使用istream::read将整个multipart/form-data请求转储到文件中,我不会丢失任何字节,并且cin永远不会变坏设置。然后,我可以以ifstream的形式重新打开转储的文件,而不必使用cin,我原来的小缓冲区getline可以很好地工作。

对这里发生的事情有任何见解吗?可能是FastCGI或Lighttpd的副作用吗?

编辑:

以下是相关的代码段:

#include <fcgio.h>
//...

int main()
{
    //...
    FCGX_Request request;

    FCGX_Init();
    FCGX_InitRequest(&request, 0, 0);

    const size_t LEN = 1024;
    vector<char> v(LEN); // Workaround for getting duplicates of every byte?
    while (FCGX_Accept_r(&request) == 0) {
        fcgi_streambuf cin_fcgi_streambuf(request.in, &v[0], v.size());
        //... (eventually calls _parseMultipartFormFieldFile)
    }

    //...
}

/*
    Extract a file from a multipart form section

    istream should already have boundary and headers removed up throguh the final "\r\n"

    Note that there are a lot of potential off-by-one errors here. Need to pay special attention
    to gcount() and what is present in the buffer in each given scenario. Hence why you see:

    gcount
    gcount-1
    gcount-2

    These offsets are due to null terminator sometimes being appended, sometimes not, and/or '\r' being present or not.

    It is possible for a few rare things to happen that will break this function:

    1. Malicious content length

    Client could lie about content length and send much more than we have room for. Should count bytes eventually, but easy enough to configure webserver to protect us.
*/
bool _parseMultipartFormFieldFile(
    Request & req,
    istream & input,
    const string & name,
    const string & upload_dir,
    const string & boundary,
    const string & end_boundary
)
{
    static unsigned int file_id = 0; //used to generate unique file names

    //Need fixed buffer size to prevent running out of RAM (malicious or not)
    char buf[4096];

    string file_name = upload_dir + ECPP_TMP_FILE + to_string(file_id++);

    ofstream f(file_name, std::ofstream::out | std::ofstream::binary);
    if (!f.is_open())
        return false;

    bool eof = false;
    while (!eof) {
        //Out of space in flash?
        if (!f.good())
            return false;

        f.flush();

        input.getline(buf, sizeof(buf));
        unsigned int gcount = input.gcount();

        if (input.bad()) {
            //Crap! If we're here, we have most likely lost a few bytes...
            input.clear();
            continue;
        }
        else if (input.eof()) {
            //If we are here, the multipart/form-data request was malformed
            f.close();
            remove(file_name.c_str()); //Delete malformed file
            return false;
        }
        else if (input.fail()) {
            //If we are in this condition, it means we encountered a line longer than our buffer
            //There is no null terminator in this case, so write out what we have
            f.write(buf, gcount);
            input.clear(); //clear fail flag
            continue;
        }

        if (gcount >= 2 && buf[gcount-2] == '\r') {
            string peek = peekLine(input); //uses putback - modifies gcount()
            if (peek == boundary || peek == end_boundary) {
                //If we are in here, it means we encountered the last line in the section
                //That means there is a trailing '\r' which we need to remove in addition to the null terminator
                f.write(buf, gcount-2); // Remove null terminator and \r before writing
                req.file[name] = file_name;
                eof = true;
                continue;
            }
        }

        //If we are here it means we read in the entire line.
        //Write out everything (minus the null terminator), and also add in the newline that was stripped by getline()
        f.write(buf, gcount-1);
        f.write("\n", 1);
    }

    return true;
}

因此,简而言之,问题在于,如果我将cin_fcgi_streambuf传递给_parseMultipartFormFieldFile,则会丢失字节(触发坏位),但是如果我不加选择地将cin_fcgi_streambuf转储到带有char buf[4000000] + input.read()的文件,然后将该文件的ifstream传递给_parseMultipartFormFieldFile,然后运行正常。

1 个答案:

答案 0 :(得分:0)

事实input.getline的返回将不包含CRLF。 因此,如果发布binary文件,会发生什么? 否则,您的示例source code无法管理multiple posted file request。 案例,您只是打开了一个文件流。这就是为什么您必须更改源代码模式。

您可以上传不受限制的data|file大小。试试这个解决方案

const char* ctype = "multipart/form-data; boundary=----WebKitFormBoundaryfm9qwXVLSbFKKR88";
size_t content_length = 1459606;
http_payload* hp = new http_payload(ctype, content_length);
if (hp->is_multipart()) {
    int ret = hp->read_all("C:\\temp\\");
    if (ret < 0) {
        std::cout << hp->get_last_error() << std::endl;
        hp->clear();
    }
    else {
        std::string dir_str("C:\\upload_dir\\");
        ret = hp->read_files([&dir_str](http_posted_file* file) {
            std::string path(dir_str.c_str());
            path.append(file->get_file_name());
            file->save_as(path.c_str());
            file->clear(); path.clear();
            std::string().swap(path);
        });
        hp->clear();
        std::cout << "Total file uploaded :" << ret << std::endl;
    }
}
else {
    int ret = hp->read_all();
    if (ret < 0) {
        std::cout << hp->get_last_error() << std::endl;
        hp->clear();
    }
    else {
        std::cout << "Posted data :" << hp->get_body() << std::endl;
        hp->clear();

    }
}

https://github.com/safeonlineworld/web_jsx/blob/0d08773c95f4ae8a9799dbd29e0a4cd84413d108/src/web_jsx/core/http_payload.cpp#L402