Question

我在我的应用程序中输入了一个200mb的文件，由于一个非常奇怪的原因，我的应用程序的内存使用量超过600mb。我尝试过vector和deque，以及std :: string和char *但没有用。我需要我的应用程序的内存使用与我正在阅读的文件几乎相同，任何建议都将非常有用。是否有导致如此多内存消耗的错误？你能指出问题还是应该重写整个问题？

Windows Vista SP1 x64，Microsoft Visual Studio 2008 SP1,32位版本，Intel CPU

到目前为止整个申请：

#include <string>
#include <vector>
#include <iostream>
#include <iomanip>
#include <fstream>
#include <sstream>
#include <iterator>
#include <algorithm>
#include <time.h>



static unsigned int getFileSize (const char *filename)
{
    std::ifstream fs;
    fs.open (filename, std::ios::binary);
    fs.seekg(0, std::ios::beg);
    const std::ios::pos_type start_pos = fs.tellg();
    fs.seekg(0, std::ios::end);
    const std::ios::pos_type end_pos = fs.tellg();
    const unsigned int ret_filesize (static_cast<unsigned int>(end_pos - start_pos));
    fs.close();
    return ret_filesize;
}
void str2Vec (std::string &str, std::vector<std::string> &vec)
{
    int newlineLastIndex(0);
    for (int loopVar01 = str.size(); loopVar01 > 0; loopVar01--)
    {
        if (str[loopVar01]=='\n')
        {
            newlineLastIndex = loopVar01;
            break;
        }
    }
    int remainder(str.size()-newlineLastIndex);

    std::vector<int> indexVec;
    indexVec.push_back(0);
    for (unsigned int lpVar02 = 0; lpVar02 < (str.size()-remainder); lpVar02++)
    {
        if (str[lpVar02] == '\n')
        {
            indexVec.push_back(lpVar02);
        }
    }
    int memSize(0);
    for (int lpVar03 = 0; lpVar03 < (indexVec.size()-1); lpVar03++)
    {
        memSize = indexVec[(lpVar03+1)] - indexVec[lpVar03];
        std::string tempStr (memSize,'0');
        memcpy(&tempStr[0],&str[indexVec[lpVar03]],memSize);
        vec.push_back(tempStr);
    }
}
void readFile(const std::string &fileName, std::vector<std::string> &vec)
{
    static unsigned int fileSize = getFileSize(fileName.c_str());
    static std::ifstream fileStream;
    fileStream.open (fileName.c_str(),std::ios::binary);
    fileStream.clear();
    fileStream.seekg (0, std::ios::beg);
    const int chunks(1000); 
    int singleChunk(fileSize/chunks);
    int remainder = fileSize - (singleChunk * chunks);
    std::string fileStr (singleChunk, '0');
    int fileIndex(0);
    for (int lpVar01 = 0; lpVar01 < chunks; lpVar01++)
    {
        fileStream.read(&fileStr[0], singleChunk);
        str2Vec(fileStr, vec);
    }
    std::string remainderStr(remainder, '0');
    fileStream.read(&remainderStr[0], remainder);
    str2Vec(fileStr, vec);      
}
int main (int argc, char *argv[])
{   
        std::vector<std::string> vec;
        std::string inFile(argv[1]);
        readFile(inFile, vec);
}

Answer 1

你的记忆力正在分散。

尝试这样的事情：

  HANDLE heaps[1025];
  DWORD nheaps = GetProcessHeaps((sizeof(heaps) / sizeof(HANDLE)) - 1, heaps);

  for (DWORD i = 0; i < nheaps; ++i) 
  {
    ULONG  HeapFragValue = 2;
    HeapSetInformation(heaps[i],
                       HeapCompatibilityInformation,
                       &HeapFragValue,
                       sizeof(HeapFragValue));
  }

Answer 2

如果我正确地阅读，那么最大的问题是该算法会自动将所需内存的两倍加倍。

在ReadFile（）中，您将整个文件读入一组'singleChunk'大小的字符串（块），然后在str2Vec（）的最后一个循环中为每个新行分隔的块分配一个临时字符串。所以你在那里加倍了记忆。

你也遇到了速度问题 - str2vec对块进行了2次传递以找到所有换行符。没有理由你不能一个人做到这一点。

Answer 3

存在STL容器以抽象出内存操作。如果你有一个硬内存限制，那么你就无法真正抽象出来。

我建议使用mmap()来读取文件（或在Windows中，MapViewOfFile()）。

Answer 4

您可以做的另一件事是将整个文件加载到一个内存块中。然后创建一个指向每行第一个字符的指针向量，同时用\ 0替换换行符，使其以空值终止。（当然假设你的字符串不应该有\ 0。）

它不一定像拥有字符串向量那样方便，但是使用const char *的向量可能“同样好”。

Answer 5

在readFile中，您至少有2个文件副本 - ifstream，以及复制到std :: vector中的数据。只要你打开文件，然后像你一样复制它，就很难将总内存占用量降低到文件大小的两倍以下。

Answer 6

不要使用std :: list。它需要更多的内存然后矢量。
vector执行所谓的“加倍”，即，当空间不足时，它会分配两次当前拥有的内存。为了避免它你可以使用std :: vector :: reserve （）方法，如果我没有弄错你可以使用std :: vector :: 容量（）方法（注意capacity（）＆gt; = size（））。

由于在执行过程中不知道行数，我看不到简单的算法来避免“加倍”问题。根据slavy13.myopenid.com的评论，解决方案是在完成阅读后将信息移动到另一个预先保留的向量（相关问题为How to downsize std::vector?）。

Answer 7

首先，您如何确定内存使用情况？任务管理器不是一个合适的工具，因为它显示的实际上并不是内存使用情况。

其次，除了你的（出于某种原因？）静态变量之外，唯一一个在你读完文件时没有被释放的数据是向量。因此，测试其容量，并测试其包含的每个字符串的容量。了解他们每次使用的内存量。你有工具来确定内存的使用位置。

Answer 8

我认为你试图编写自己的缓冲策略是错误的。

这些流已经实施了非常好的缓冲策略。如果您认为需要更大的缓冲区，则可以在流中安装基本缓冲区，而无需任何额外的代码来控制缓冲区。

这是我提出的： NB使用我在网上找到的“King James Bible”的文本版进行了测试。

#include <string>
#include <vector>
#include <list>
#include <fstream>
#include <algorithm>
#include <iterator>
#include <iostream>

class Line: public std::string
{
};

std::istream& operator>>(std::istream& in,Line& line)
{
    // Relatively efficient way to copy a line into a string.
    return std::getline(in,line);
}
std::ostream& operator<<(std::ostream& out,Line const& line)
{
    return out << static_cast<std::string const&>(line) << "\n";
}

void readLinesFromStream(std::istream& stream,std::vector<Line>& lines)
{
    /*
     * Read into a list as this is flexible in memory usage and will not
     * allocate huge chunks of un-required space.
     *
     * Even with huge files the space for list will be insignificant
     * compared to the size of the data.
     *
     * This then allows us to reserve the correct size of the vector
     * Thus avoiding huge memory chunks being prematurely allocated that
     * are not required. It also prevents the internal structure from
     * being copied every time the container is re-sized.
     */
    std::list<Line>     data;
    std::copy(  std::istream_iterator<Line>(stream),
                std::istream_iterator<Line>(),
                std::inserter(data,data.end())
             );

    /*
     * Reserve the correct size in the vector.
     * then copy out of the list into the vector
     */
    lines.reserve(data.size());
    std::copy(  data.begin(),
                data.end(),
                std::back_inserter(lines)
             );
}

void readLinesFromFile(std::string const& name,std::vector<Line>& lines)
{
    /*
     * Set up the file stream and override the default buffer used by the stream.
     * Make it big because we think the istream buffer is insufficient!!!!
     */
    std::ifstream       file;
    std::vector<char>   buffer(10000);
    file.rdbuf()->pubsetbuf(&buffer[0],buffer.size());

    file.open(name.c_str());
    readLinesFromStream(file,lines);
}


int main(int argc,char* argv[])
{
    std::vector<Line>   lines;
    readLinesFromFile(argv[1],lines);

    // Un-comment if your file is larger than 1100 lines.

    // I tested with a copy of the King James bible. 
    // std::cout << "Lines: " << lines.size() << "\n";
    // std::copy(lines.begin() + 1000,lines.begin() + 1100,std::ostream_iterator<Line>(std::cout));
}

Answer 9

尝试使用列表而不是矢量。向量在内存中几乎总是线性的。

当然，你有内部字符串，（几乎总是）复制修改，引用计数这一事实应该会减少问题，但它可能有所帮助。

Answer 10

我不知道这是否相关，因为我真的不知道你的文件是什么样的。

但是你应该知道，当存储一个非常短的字符串时，std :: string可能会有相当大的空间开销。如果你单独为新的短字符串添加char *，你也会看到所有的分配块开销。

你在这个向量中放了多少个字符串，它们的平均长度是多少？

Answer 11

也许您应该详细说明为什么需要在内存中读取整个文件，我怀疑可能有一种方法可以执行您想要的操作而无需立即将整个文件读入内存。如果你真的需要这个功能，请查看内存映射文件，这些文件可能比编写等效文件更有效。然后，您的内部数据结构可以使用offset到文件中。顺便一下，一定要看看你是否需要处理字符编码。

Answer 12

您应该知道，因为您将fileStream声明为static，它永远不会超出范围，这意味着文件在执行的最后一刻才会关闭。这肯定会涉及一些内存。您可以在最后str2Vec之前明确关闭它以尝试帮助解决问题。

此外，您多次打开和关闭同一个文件，只需打开一次并通过引用传递它（如果需要，重置状态）。虽然我想你可以通过一次传递来完成你所需要的东西。

哎呀，我怀疑你真的需要像你在这里一样知道文件大小，你可以读取大小“块”的数量，直到你得到一个简短的阅读（此时你已经完成了）。

为什么不解释代码的目标，我觉得有一个多更简单的解决方案。

Answer 13

我发现做行的最佳方法是只读内存映射文件。不要为\ n编写\ 0，而是使用成对的const char * s，如std::pair<const char*, const char*>或成对的const char*和一个计数..如果你需要编辑行，一个好方法是创建一个可以存储指针对的对象或一个带有修改后的行的std :: string。

至于使用STL向量或deques在内存中节省空间，一个好的技巧是让它加倍，直到你完成添加它。然后将其大小调整为实际大小，这应该将未使用的内存释放回堆分配器。内存可能仍然分配给程序，但我不担心。此外，不是采用默认大小，而是以字节为单位获取文件大小，除以每行平均字符的最佳猜测，并在开始时保留这么大的空间。

Answer 14

pushBack（）增长的向量将导致内存碎片和低效的内存使用。我会尝试使用列表，只有在确切知道需要多少元素时才创建一个向量（如果需要）。

我认为STL导致我的应用程序三倍内存使用

14 个答案: