Question

我有以下输入/输出和正则表达式代码可以正常工作（对于下面的输入/输出）。

- 输入 -

keep this

      keep this too

     Bye
------ Remove Below ------
  remove all of this

- 输出 -

keep this

      keep this too

     Bye

- 代码 -

    String text = "keep this\n       \n"
            + "      keep this too\n      \n     Bye\n------ Remove Below ------\n  remove all of this\n";
    System.out.println(text);
    Pattern PATTERN = Pattern.compile("^(.*?)(-+)(.*?)Remove Below(.*?)(-+)(.*?)$",
             Pattern.DOTALL);
    Matcher m = PATTERN.matcher(text);
    if (m.find()) {
        // remove everything as expected (from about input->regex->output)
        text =  ((m.group(1)).replaceAll("[\n]+$", "")).replaceAll("\\s+$", "");
        System.out.println(m.group(1));
        System.out.println(text);
    }

好的，所以效果很好。但是，这是针对具有已定义输入输出的测试。当我得到包含以下字符/模式序列的大文件时，我必须解析代码需要一段时间才能按照Find（）方法对大小为100k的文件执行（4-5秒）有以下模式。实际上有时我不确定它是否正在返回...当我作为调试测试时，find（）方法挂起并且我的客户端断开连接。

注意：此文件中没有任何内容可供匹配...但这是一种对我的正则表达式征税的模式。

- 100k档 -

junk here
more junk here
o o o (even more junk per the ellipses) 
-------------------------------------this is junk
junk here
more junk here
o o o (even more junk per the ellipses) 
-------------------------------------this is junk
junk here
more junk here
o o o (even more junk per the ellipses) 
-------------------------------------this is junk
junk here
more junk here
o o o (even more junk per the ellipses) 


this repeats from above to make up the 100k file.

- ASK -

如何优化上述正则表达式来处理大文件模式如上所述或正常情况下正则表达式解析速度（4-6秒）是否完全悬挂？

Answer 1

你是对的，这是一个追溯的噩梦！

使用通配符时避免可能的匹配。一些策略，可能会有所帮助：

如果已知' - '的数量，请使用具体字符串进行测试：

^(.*?)(------ Remove Below ------)(.*)$

或至少更具体一点

^(.*?)-*-\s*Remove Below\s*--*(.*?)$

更确切地说：

^(.*?)(-+)([^-]*)Remove Below([^-]*)(-+)(.*?)$

如果可以，请贪婪：

^(.*)(-+)(.*?)Remove Below(.*?)(-+)(.*?)$

如果不需要，请不要包括在比赛中：

^(.*?)-+.*?Remove Below.*?-+.*?$

当然，根据您的输入质量，您可以将这些概念结合起来：

^(.*)------ Remove Below ------.*$

在您的情况下，逐行解析以及何时匹配^.*-+\s*Remove Below\s*-+.*$停止修改

Answer 2

由于您只对------ Remove Below ------行以上的文字感兴趣，因此您无需匹配所有内容。通过缩短你的正则表达式来匹配你想要的东西，避免过度匹配和回溯。

Pattern PATTERN = Pattern.compile("^(.*?)-+ *Remove Below *-+", Pattern.DOTALL);

Answer 3

如果您确定要删除的内容是在文件末尾反转您的输入字符串。这应该对你有很大的帮助。而不是

Matcher m = PATTERN.matcher(text);

使用

Matcher m = PATTERN.matcher(new StringBuilder(text).reverse());

请记住也要反转模式。

Answer 4

您可以使用第三方正则表达式库。 Here you have benchmarks

可能的回溯跟踪正则表达式性能问题？

4 个答案: