使用正则表达式处理大型文件时出现StackOverflowError

时间:2014-01-31 12:42:25

标签: java regex large-files stack-overflow

我有以下代码:

 String regex = "Some String(.*\r?\n)*.*\\* testing .*\r?\n.*\r?\n.*children.*\r?\n";
 Pattern p = Pattern.compile(regex);
 System.out.println(p.pattern());
 String content = readFile(); // it return file content
 Matcher m = p.matcher(content);

因此,当我使用小文件内容进行测试时,它可以正常工作。但是对于大文件内容,我的代码会出现以下错误:

java.lang.StackOverflowError
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4606)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    ...

我认为正则表达式引擎正在产生问题。这可能是因为嵌套量词。那么任何人都可以优化我的正则表达式,以便正则表达式引擎能够轻松处理它吗?

1 个答案:

答案 0 :(得分:1)

可能的问题是你的正则表达式匹配太多行。既然你有:

\r?\n 

在您的记录中多次,它可以匹配多行。

所以:

"Some String(.*\r?\n)*.*\\* testing .*\r?\n.*\r?\n.*children.*\r?\n"

会匹配

Some String


\ testing



children


Some String

您需要找到一个完成每条记录的表达式,否则正则表达式引擎将能够匹配从第一个记录到最后一个记录,并且可能会破坏。

这个理论显示在堆栈跟踪中,其中有几个堆栈跟踪块与此重复:

at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)
at java.util.regex.Pattern$Curly.match(Pattern.java:4132)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)