Question

我必须加载带有数据的大文件（几GB），我想将它们加载到二维向量中。下面的代码完成了这项工作，但它的速度非常慢。更具体地说，目标是获得第二列中的值等于index（_lh，_sh）的所有行。然后排除第4列值与第+ 1行和第1行相同的行。现在，我是c ++的新手，我通常用Python编写代码（已经有了解决这个问题的代码）。但我需要它尽可能快，所以我试图将我的python代码重写为C ++。但它现在比Python慢（并且只实现了向量的数据）...所以在我继续之前，我想改进它。根据我在类似问题中发现的问题，问题是动态矢量，.push_back（）和getline（）。

我对类似问题中提到的maping和chunk加载感到困惑，所以我无法根据这些更改代码。

你能帮我优化一下这段代码吗？

谢谢。

#include <iostream>
#include <sstream>
#include <fstream>
#include <array>
#include <string>
#include <vector>

using namespace std;

int pixel(int radek, int sloupec, int rozmer = 256) {
    int index = (radek - 1) * rozmer + sloupec;
    int index_lh = (index - rozmer - 1);
    int index_sh = (index - rozmer);
    int index_ph = (index - rozmer + 1);
    int index_l = (index - 1);
    int index_p = (index + 1);
    int index_ld = (index + rozmer - 1);
    int index_sd = (index + rozmer);
    int index_pd = (index + rozmer + 1);
    array<int, 9> index_all = { {index, index_lh, index_sh, index_ph, index_l, index_p, index_ld, index_sd, index_pd } };
    vector<vector<string>> Data;
    vector<string> Line;
    string line;

    for (int m = 2; m < 3; m++) {
        string url = ("e:/TPX3 - kalibrace - 170420/ToT_ToA_calib_Zn_" + to_string(m) + string(".t3pa"));
        cout << url << endl;
        ifstream infile(url);
        if (!infile)
        {
            cout << "Error opening output file" << endl;
            system("pause");
            return -1;
        }
        while (getline(infile, line))
        {
            Line.push_back(line);
            istringstream txtStream(line);
            string txtElement;
            vector<string> Element;
            while (getline(txtStream, txtElement, '\t')){
                Element.push_back(txtElement);
            }
            Data.push_back(Element);
        }
    }
    cout << Data[1][0] << ' ' << Data[1][1] << ' ' << Data[1][2] << endl;
    return 0; 
}

int main()
{   
    int x = pixel(120, 120);
    cout << x << endl;
    system("pause");
    return 0;
}

Answer 1

如果它们的底层缓冲区经常被重新分配，则向量会变慢。需要在连续内存缓冲区上实现向量，每次超出缓冲区限制时，都必须分配一个新的更大的缓冲区，然后将内容从旧缓冲区复制到新缓冲区。如果你知道你需要多大的缓冲区（你不需要被激活），你可以帮助程序通过使用例如大小来分配适当大小的缓冲区。 Data.reserve(n)（其中n大约是您认为需要的元素数量）。这确实注意改变＆＃34;尺寸＆＃34;向量的大小，只是底层缓冲区的大小。作为结束语，我不得不说我还没有真正对此进行基准测试，因此这可能会也可能不会提高您的计划的性能。

编辑：尽管如此，我认为性能有点被线Data.push_back(Element);瓶装了一下，它产生了Element-vector的副本。如果您正在使用C ++ 11，我相信可以通过执行类似Data.emplace_back(std::move(Element));的操作来解决此问题，在这种情况下，您之后无法改变Element（它的内容被移动了）。您还需要在memory中加入std::move。

Answer 2

您可以尝试使用旧的C文件阅读API（FILE*，fopen()等）或为std::istringstream设置更大的缓冲区，如下所示

constexp std::size_t  dimBuff { 10240 } // 10K, by example
char myBuff[dimBuff];

// ...

istringstream txtStream(line);
txtStream.rdbuf()->pubsetbuf(myBuff, dimBuff);

您可以尝试使用std::deque代替std::vector s（但我不知道这是否有用）。

正如muos所建议的那样，你可以使用移动语义;您也可以使用emplace_back()。

所以我建议尝试

Element.push_back(std::move(txtElement));

Data.push_back(std::move(Element));

或

Element.emplace_back(std::move(txtElement));

Data.emplace_back(std::move(Element));

您还可以切换以下行（如果我没有错，则std::istringstream的字符串中没有移动构造函数）

Line.push_back(line);
istringstream txtStream(line);

添加移动语义（和emplace_back()）

istringstream txtStream(line);
Line.emplace_back(std::move(line));

p.s。：显然reserve()是有用的

Answer 3

在while循环中，您可以尝试更改

中的行

while (getline(infile, line))
{
    Line.push_back(line);
    istringstream txtStream(line);
    string txtElement;
    vector<string> Element;
    while (getline(txtStream, txtElement, '\t')){
        Element.push_back(txtElement);
    }
    Data.push_back(Element);
}

为：

while (getline(infile, line))
{
    Line.push_back(line);
    istringstream txtStream(line);
    string txtElement;
    //vector<string> Element; [-]
    Data.emplace_back(); // [+]
    while (getline(txtStream, txtElement, '\t')) {
        //Element.push_back(txtElement); [-]
        Data.back().push_back(txtElement); // [+]
    }
    //Data.push_back(Element); [-]
}

这样，Data中的向量不需要在那里移动或复制 - 它们已经构建，尽管是空的。 Data中的向量是默认构造的.emplace_back()。我们使用Data函数获取.back()中的最后一个元素，并像往常一样使用.push_back()推送我们的值。希望这有助于：）

Answer 4

您还可以对矢量使用reserve(int)，以便创建更接近目标尺寸的矢量。

这也可以避免在堆周围跳过大量的向量跳转，因为向量只会重新传递它通过目标大小。

如果vector传递了您之前保留的大小，则可以再次调用reserve：

vector<int> vec;
vec.reserve(10);
for (int i=0;i < 1000; i++)
{
    if ( vec.size() == vec.capacity() )
    {
        vec.reserve(vec.size()+10);

    }
    vec.push_back(i);
}

如何加快文本文件加载到多向量

4 个答案: