Question

我有一个程序需要分析遍布多个文件系统的100,000个文件。

处理了大约3000个文件后，它开始变慢。我通过gprof运行它，但由于减速直到30-60秒才开始进行分析，我认为它不会告诉我太多。

我如何追查原因？ top没有显示高CPU，并且进程内存不会随着时间的推移而增加，所以I / O？

在顶层，我们有：

scanner.init(); // build a std::vector<std::string> of pathnames.
scanner.scan(); // analyze those files

现在，init（）在1秒内完成。它使用70,000个实际文件名和30,000个符号链接填充向量。

scan（）遍历向量中的条目，查看文件名，读取内容（比如1KB的文本），并构建“段列表”[1]

我读过有关使用std :: strings的邪恶的相互矛盾的观点，特别是将它们作为参数传递。所有函数都传递和引用std :: strings，structure等。

但它确实使用了大量的字符串处理来解析文件名，提取子字符串并搜索子字符串。（如果它们是邪恶的，程序应该总是很慢，而不是在一段时间后慢下来。

这可能是随着时间的推移而减速的原因吗？

该算法非常简单，没有任何new / delete运算符......

缩写，scan（）：

while (tsFile != mFileMap.end())
{
    curFileInfo.filePath = tsFile->second;

    mpUtils->parseDateTimeString(tsFile->first, curFileInfo.start);

    // Ignore files too small
    size_t fs = mpFileActions->fileSize(curFileInfo.filePath);
    mDvStorInfo.tsSizeBytes += fs;

    if (fileNum++ % 200 == 0)
    {
        usleep(LONGNAPUSEC); // long nap to give others a turn
    }

    // collect file information
    curFileInfo.locked    = isLocked(curFileInfo.filePath);
    curFileInfo.sizeBytes = mpFileActions->fileSize(curFileInfo.filePath);
    getTsRateAndPktSize(curFileInfo.filePath, curFileInfo.rateBps, curFileInfo.pktSize);
    getServiceIdList(curFileInfo.filePath, curFileInfo.svcIdList);

    std::string fileBasePath;
    fileBasePath = mpUtils->strReplace(".ts",     "", curFileInfo.filePath.c_str());
    fileBasePath = mpUtils->strReplace(".lockts", "", fileBasePath.c_str()); // chained replace

    // Extract the last part of the filename, ie. /mnt/das.b/20160327.104200.to.20160327.104400
    getFileEndTimeAndDuration(fileBasePath, curFileInfo);

    // Update machine info for both actual ts duration and span including gaps
    mDvStorInfo.tsDurationSec     += curFileInfo.durSec;

    if (!firstTime)
    {
        // beef is here.
        if (hasGap(curFileInfo, prevFileInfo)           ||
            lockChanged(curFileInfo, prevFileInfo)      ||
            svcIdListChanged(curFileInfo, prevFileInfo) ||
            lastTsFile(tsFile))
        {
            // This current file differs from those before it so
            // close off previous segment and push to list

            curSegInfo.prevFileStart = curFileInfo.start;

            mSegmentList.push_back(curSegInfo);

            prevFileInfo = curFileInfo;  // do this before resetting everything!

            // initialize the new segment
            resetSegmentInfo(curSegInfo);
            copyValues(curSegInfo, curFileInfo);
            resetFileInfo(curFileInfo);
        }
        else
        {
            // still running. Update current segment info
            curSegInfo.durSec       += curFileInfo.durSec;
            curSegInfo.sizeBytes    += curFileInfo.sizeBytes;
            curSegInfo.end           = curFileInfo.end;
            curSegInfo.prevFileStart = prevFileInfo.start;

            prevFileInfo = curFileInfo;
        }
    }
    else // first time
    {
        firstTime = false;
        prevFileInfo = curFileInfo;
        copyValues(curSegInfo, curFileInfo);
        resetFileInfo(curFileInfo);
    }

    ++tsFile;
}

其中： curFileInfo/prevFileInfo是一个简单的结构。其他函数执行字符串处理，返回对std :: strings

的引用通过致电fileSize来计算

stat() getServiceIdList使用fopen打开文件，读取每一行并关闭文件。

更新

删除push_back到容器并没有改变性能。但是，重写使用C函数（例如strstr（），strcpy（）等）现在表现出恒定的性能。

罪魁祸首是std :: strings - 尽管传递为＆amp; refs，我猜太多构造/破坏/复制。

[1]文件名由YYYYMMDD.HHMMSS日期/时间命名，例如20160612.093200。该程序的目的是查找70,000个文件的名称中的时间间隔，并构建一个连续时间段列表。

Answer 1

这可能是堆碎片问题。随着时间的推移，堆可以变成瑞士奶酪，这使得内存管理器更难分配块，并且即使有空闲RAM也可能强制交换，因为没有足够大的连续空闲块。关于堆碎片的Here's an MSDN article。

您提到使用std::vector来保证连续内存，因此可能是堆碎片的主要罪魁祸首，因为每次收集超出边界时它必须释放并重新分配。如果您不需要连续保证，您可以尝试使用其他容器。

Answer 2

文件名由YYYYMMDD.HHMMSS日期/时间命名，例如20160612.093200。该计划的目的是在70,000个文件的名称中查找时间差并构建一个连续时间段列表

比较字符串很慢;上）。比较整数很快; O（1）。不要将文件名存储为字符串，而应考虑将它们存储为整数（或整数对）。

如果可能的话，我强烈建议您使用哈希映射。请参阅std :: unordered_set和std :: unordered_map。这些将大大减少比较次数。

删除push_back到容器并没有改变性能。但是，重写使用C函数（例如strstr（），strcpy（）等）现在表现出恒定的性能。

std::set<char*>正在排序指针地址，而不是它们包含的字符串。

不要忘记std::move你的字符串以减少分配。

C ++随着时间的推移而减慢，读取70,000个文件

2 个答案: