Question

我在VS中运行c ++程序。我提供了一个正则表达式，我正在解析一个超过200万行的文件，用于匹配该正则表达式的字符串。这是代码：

int main() {
    ifstream myfile("file.log");
    if (myfile.is_open())
    {
        int order_count = 0;
        regex pat(R"(.*(SOME)(\s)*(TEXT).*)");
        for (string line; getline(myfile, line);)
        {
            smatch matches;
            if (regex_search(line, matches, pat)) {
                order_count++;
            }
        }
        myfile.close();
        cout << order_count;
    }

    return 0;
}

该文件应搜索匹配的字符串并计算它们的出现次数。我有一个python版本的程序，使用相同的正则表达式在4秒内完成。我已经等了大约5分钟才能使上面的c ++代码工作，但仍然没有完成。它没有遇到无限循环，因为我让它以一定的间隔打印出当前行号并且它正在进行中。我应该用不同的方式编写上面的代码吗？

编辑：这是在发布模式下运行。

编辑：这是python代码：

class PythonLogParser:

def __init__(self, filename):
    self.filename = filename

def open_file(self):
    f = open(self.filename)
    return f

def count_stuff(self):
    f = self.open_file()
    order_pattern = re.compile(r'(.*(SOME)(\s)*(TEXT).*)')
    order_count = 0
    for line in f:
        if order_pattern.match(line) != None:
            order_count+=1 # = order_count + 1
    print 'Number of Orders (\'ORDER\'): {0}\n'.format(order_count)
    f.close()

程序终于停止运行了。最令人不安的是输出不正确（我知道应该是什么值）。

也许使用正则表达式来解决这个问题不是最佳解决方案。如果我找到一个更好的解决方案，我会更新。

编辑：根据@ecatmur的回答，我做了以下更改，并且c ++程序运行得更快。

int main() {
        ifstream myfile("file.log");
        if (myfile.is_open())
        {
            int order_count = 0;
            regex pat(R"(.*(SOME)(\s)*(TEXT).*)");
            for (string line; getline(myfile, line);)
            {
                if (regex_match(line, pat)) {
                    order_count++;
                }
            }
            myfile.close();
            cout << order_count;
        }

        return 0;
    }

Answer 1

您应该使用regex_match，而不是regex_search。

7.2.5.3. search() vs. match()

Python提供了两种基于正则表达式的基本操作：re.match（）仅在字符串的开头检查匹配，而re.search（）检查字符串中任何位置的匹配

和

std::regex_search

regex_search将成功匹配给定序列的任何子序列，而std::regex_match仅在正则表达式与整个序列匹配时才返回true。

使用regex_search您生成n * m个匹配结果，其中n是之前的字符数，m是您的中心部分之后的字符数搜索字符串。毫不奇怪，生成需要很长时间。

事实上，使用regex_search效率更高，但仅使用搜索字符串的中心部分：

    regex pat(R"((SOME)(\s)*(TEXT))");

并使用regex_search的重载，但不会将匹配结果输出参数（因为您忽略了匹配结果）：

        if (regex_search(line, pat)) {    // no "matches"
            order_count++;
        }

C ++程序花费几分钟来解析大文件，而python在几秒钟内运行

1 个答案:

7.2.5.3. search() vs. match()

`std::regex_search`