Question

我正在尝试制作一个程序，以使用Excel文档作为配置文件来读取用户输入的通配符文件和通配符字符串。例如，用户可能可以输入C：\ Read * .txt，并且C驱动器中以Read开头的所有文件，然后是读取后的任何字符以及文本文件，都将包含在搜索中。

他们可以搜索Message：*，所有以“ Message：”开头并以任何字符序列结尾的字符串都将被匹配。

到目前为止，它是一个正在运行的程序，但问题是速度效率非常糟糕，我需要它能够搜索非常大的文件。我正在使用文件流和regex类来执行此操作，但不确定是否要花费这么多时间。

我的代码中的大部分时间都花在以下循环中（我只包括了while循环上方的行，以便您可以更好地理解我的工作方式）：

smatch matches;
vector<regex> expressions;

for (int i = 0; i < regex_patterns.size(); i++){expressions.emplace_back(regex_patterns.at(i));}

auto startTimer = high_resolution_clock::now();
// Open file and begin reading
ifstream stream1(filePath);
if (stream1.is_open())
{
    int count = 0;
    while (getline(stream1, line))
    {
        // Continue to next step if line is empty, no point in searching it.
        if (line.size() == 0)
        {
            // Continue to next step if line is empty, no point in searching it.
            continue;
        }

        // Loop through each search string, if match, save line number and line text,
        for (int i = 0; i < expressions.size(); i++)
        {
            size_t found = regex_search(line, matches, expressions.at(i));
            if (found == 1)
            {
                lineNumb.push_back(count);
                lineTextToSave.push_back(line);
            }
        }
        count = count + 1;
    }
}
auto stopTimer = high_resolution_clock::now();
auto duration2 = duration_cast<milliseconds>(stopTimer - startTimer);
cout << "Time to search file: " << duration2.count() << "\n";

是否有比这更好的搜索文件的方法？我尝试查找许多东西，但至今未找到我所理解的程序化示例。

Answer 1

按优先顺序排列的一些想法：

您可以将所有的正则表达式模式连接在一起以形成单个正则表达式，而不是在每一行上匹配r正则表达式。这样可以将程序加速r。示例：(R1)|(R2)|(...)|(Rr)
确保使用前正在编译正则表达式。
请勿将最后的.*添加到您的正则表达式模式中。

有些想法但不可移植：

内存映射文件而不是通过iostream读取
考虑是否值得重新实现grep而不是通过grep调用popen()

如何在C ++中加快正则表达式搜索大量潜在大文件的速度？

1 个答案: