java Matcher中的偶发堆栈溢出错误

时间:2015-09-30 21:57:57

标签: java regex

我有一些文件解析器代码,我偶尔会在m.matches()上获得堆栈溢出错误(其中m是匹配器)。

我再次运行我的应用程序,它解析相同的文件,没有堆栈溢出。

我的模式确实有点复杂。它基本上是一堆可选的零长度正向前瞻,其中包含命名组,因此我可以匹配一堆变量名称/值对,而不管它们的顺序如何。但我希望,如果某些字符串会导致堆栈溢出错误,它总是会导致它......不仅仅是有时......任何想法?

我的模式的简化版本     "prefix(?=\\s+user=(?<user>\\S+))?(?=\\s+repo=(?<repo>\\S+))?.*?"

完整的正则表达式是......

app=github(?=(?:[^"]|"[^"]*")*\s+user=(?<user>\S+))?(?=(?:[^"]|"[^"]*")*\s+repo=(?<repo>\S+))?(?=(?:[^"]|"[^"]*")*\s+remote_address=(?<ip>\S+))?(?=(?:[^"]|"[^"]*")*\s+now="(?<time>\S+)\+\d\d:\d\d")?(?=(?:[^"]|"[^"]*")*\s+url="(?<url>\S+)")?(?=(?:[^"]|"[^"]*")*\s+referer="(?<referer>\S+)")?(?=(?:[^"]|"[^"]*")*\s+status=(?<status>\S+))?(?=(?:[^"]|"[^"]*")*\s+elapsed=(?<elapsed>\S+))?(?=(?:[^"]|"[^"]*")*\s+request_method=(?<requestmethod>\S+))?(?=(?:[^"]|"[^"]*")*\s+created_at="(?<createdat>\S+)(?:-|\+)\d\d:\d\d")?(?=(?:[^"]|"[^"]*")*\s+pull_request_id=(?<pullrequestid>\d+))?(?=(?:[^"]|"[^"]*")*\s+at=(?<at>\S+))?(?=(?:[^"]|"[^"]*")*\s+fn=(?<fn>\S+))?(?=(?:[^"]|"[^"]*")*\s+method=(?<method>\S+))?(?=(?:[^"]|"[^"]*")*\s+current_user=(?<user2>\S+))?(?=(?:[^"]|"[^"]*")*\s+content_length=(?<contentlength>\S+))?(?=(?:[^"]|"[^"]*")*\s+request_category=(?<requestcategory>\S+))?(?=(?:[^"]|"[^"]*")*\s+controller=(?<controller>\S+))?(?=(?:[^"]|"[^"]*")*\s+action=(?<action>\S+))?.*?

堆栈顶部溢出错误堆栈...(它长约9800行)

Exception: java.lang.StackOverflowError
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)

我遇到错误的行示例。 (虽然我已经运行了10次,但没有收到任何错误)

app=github env=production enterprise=true auth_fingerprint=\"token:6b29527b:9.99.999.99\" controller=\"Api::GitCommits\" path_info=\"/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/77ae1376f969059f5f1e23cc5669bff8cca50563.diff\" query_string=nil version=v3 auth=oauth current_user=abcdefghijk oauth_access_id=24 oauth_application_id=0 oauth_scopes=\"gist,notifications,repo,user\" route=\"/repositories/:repository_id/git/commits/:id\" org=XYZ-ABCDE oauth_party=personal repo=XYZ-ABCDE/abcdefg-abc repo_visibility=private now=\"2015-09-24T13:44:52+00:00\" request_id=675fa67e-c1de-4bfa-a965-127b928d427a server_id=c31404fc-b7d0-41a1-8017-fc1a6dce8111 remote_address=9.99.999.99 request_method=get content_length=92 content_type=\"application/json; charset=utf-8\" user_agent=nil accept=application/json language=nil referer=nil x_requested_with=nil status=404 elapsed=0.041 url=\"https://git.abc.abcd.abc.com/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/77ae1376f969059f5f1e23cc5669bff8cca50563.diff\" worker_request_count=77192 request_category=apiapp=github env=production enterprise=true auth_fingerprint=\"token:6b29527b:9.99.999.99\" controller=\"Api::GitCommits\" path_info=\"/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/9bee255c7b13c589f4e9f1cb2d4ebb5b8519ba9c.diff\" query_string=nil version=v3 auth=oauth current_user=abcdefghijk oauth_access_id=24 oauth_application_id=0 oauth_scopes=\"gist,notifications,repo,user\" route=\"/repositories/:repository_id/git/commits/:id\" org=XYZ-ABCDE oauth_party=personal repo=XYZ-ABCDE/abcdefg-abc repo_visibility=private now=\"2015-09-24T13:44:52+00:00\" request_id=89fcb32e-9ab5-47f7-9464-e5f5cff175e8 server_id=1b74880a-5124-4483-adce-111b60dac111 remote_address=9.99.999.99 request_method=get content_length=92 content_type=\"application/json; charset=utf-8\" user_agent=nil accept=application/json language=nil referer=nil x_requested_with=nil status=404 elapsed=0.024 url=\"https://git.abc.abcd.abc.com/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/9bee255c7b13c589f4e9f1cb2d4ebb5b8519ba9c.diff\" worker_request_count=76263 request_category=api
有趣的是......这一行似乎是一个错误...日志似乎在一个错误的位置放置一个换行符导致两个日志条目在一行上后跟一个空行。正是这条长线引起了错误......好吧无论如何......现在它运行得很好没有堆栈溢出

2 个答案:

答案 0 :(得分:11)

有两种方法可以解决您的问题:

  • 正确解析输入字符串并从Map获取键值。

    我强烈建议您使用此方法,因为代码会更清晰,我们不再需要查看输入大小的限制。

  • 修改现有的正则表达式,以大大减少导致StackOverflowError的实现缺陷的影响。

解析输入字符串

您可以使用以下正则表达式解析输入字符串:

\G\s*+(\w++)=([^\s"]++|"[^"]*+")(?:\s++|$)
  • 所有量词都具有占有性(*+而不是*++而不是+),因为我写的模式不需要回溯。

  • 您可以找到基本正则表达式(\w++)=([^\s"]++|"[^"]*+")以匹配中间的键值对。

  • \G是为了确保匹配从最后一场比赛的开始处开始。它用于防止引擎在无法匹配时“碰撞”。

  • \s*+(?:\s++|$)用于消耗多余的空间。我指定(?:\s++|$)而不是\s*+,以防止key="value"key=value被识别为有效输入。

完整的示例代码可以在下面找到:

private static final Pattern KEY_VALUE = Pattern.compile("\\G\\s*+(\\w++)=([^\\s\"]++|\"[^\"]*+\")(?:\\s++|$)");

public static Map<String, String> parseKeyValue(String kvString) {
    Matcher matcher = KEY_VALUE.matcher(kvString);

    Map<String, String> output = new HashMap<String, String>();
    int lastIndex = -1;

    while (matcher.find()) {
        output.put(matcher.group(1), matcher.group(2));
        lastIndex = matcher.end();
    }

    // Make sure that we match everything from the input string
    if (lastIndex != kvString.length()) {
        return null;
    }

    return output;
}

您可能希望根据您的要求取消引用这些值。

您还可以重写函数以传递要提取的List个键,然后在while循环中选择它们以避免存储您不关心的键。

修改正则表达式

问题是由于外部重复(?:[^"]|"[^"]*")*通过递归实现,当输入字符串足够长时导致StackOverflowError

具体而言,在每次重复中,它匹配引用的标记或单个非引用的字符。结果,堆栈随着非引用字符的数量线性增长而爆炸。

您可以将(?:[^"]|"[^"]*")*的所有实例替换为[^"]*(?:"[^"]*"[^"]*)*。堆栈现在将作为引用令牌的数量线性增长,因此不会发生StackOverflowError,除非输入字符串中有数千个引用的令牌。

Pattern KEY_CAPTURE = Pattern.compile("app=github(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+user=(?<user>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+repo=(?<repo>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+remote_address=(?<ip>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+now=\"(?<time>\\S+)\\+\\d\\d:\\d\\d\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+url=\"(?<url>\\S+)\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+referer=\"(?<referer>\\S+)\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+status=(?<status>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+elapsed=(?<elapsed>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+request_method=(?<requestmethod>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+created_at=\"(?<createdat>\\S+)(?:-|\\+)\\d\\d:\\d\\d\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+pull_request_id=(?<pullrequestid>\\d+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+at=(?<at>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+fn=(?<fn>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+method=(?<method>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+current_user=(?<user2>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+content_length=(?<contentlength>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+request_category=(?<requestcategory>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+controller=(?<controller>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+action=(?<action>\\S+))?");

它遵循正则表达式(A|B)*A*(BA*)*的等效扩展。哪个用作A或B取决于它们的重复次数 - 无论哪个重复都应该是A而另一个应该是B.

深入探讨实施

StackOverflowError中的{p> Pattern是一个已知问题,当您的模式包含重复非确定性 1 捕获/时,可能会发生此问题非捕获组,在您的情况下是子模式(?:[^"]|"[^"]*")*

1 这是Pattern源代码中使用的术语,可能旨在表明模式已修复长度。但是,无论实际模式如何,实施都认为交替|是非确定性的。

非确定性捕获/非捕获组的贪婪或懒惰重复被编译到Loop / LazyLoop类中,这些类通过递归实现重复。因此,此类模式极易触发StackOverflowError,尤其是当该组包含一次只匹配单个字符的分支时。

另一方面, deterministic 2 重复,占有重复和重复独立(?>...)(又名 atomic group或 non-backtracking group)被编译成Curly / GroupCurly类,在大多数情况下会循环处理重复,所以会有没有StackOverflowError

2 重复模式是一个字符类,或一个固定长度的捕获/非捕获组,没有任何替换

您可以在下面看到如何编译原始正则表达式的片段。记下有问题的部分,它以Loop开头,并将其与堆栈跟踪进行比较。

app=github(?=(?:[^"]|"[^"]*")*\s+user=(?<user>\S+))?(?=(?:[^"]|"[^"]*")*\s+repo=(?<repo>\S+))?
BnM. Boyer-Moore (BMP only version) (length=10)
  app=github
Ques. Greedy optional quantifier
  Pos. Positive look-ahead
    GroupHead. local=0
    Prolog. Loop wrapper
    Loop [1889ca51]. Greedy quantifier {0,2147483647}
      GroupHead. local=1
      Branch. Alternation (in printed order):
        CharProperty.complement. S̄:
          BitClass. Match any of these 1 character(s):
            "
        ---
        Single. Match code point: U+0022 QUOTATION MARK
        Curly. Greedy quantifier {0,2147483647}
          CharProperty.complement. S̄:
            BitClass. Match any of these 1 character(s):
              "
          Node. Accept match
        Single. Match code point: U+0022 QUOTATION MARK
        ---
      BranchConn [7e41986c]. Connect branches to sequel.
      GroupTail [47e1b36]. local=1, group=0. --[next]--> Loop [1889ca51]
    Curly. Greedy quantifier {1,2147483647}
      Ctype. POSIX (US-ASCII): SPACE
      Node. Accept match
    Slice. Match the following sequence (BMP only version) (length=5)
      user=
    GroupHead. local=3
    Curly. Greedy quantifier {1,2147483647}
      CharProperty.complement. S̄:
        Ctype. POSIX (US-ASCII): SPACE
      Node. Accept match
    GroupTail [732c7887]. local=3, group=2. --[next]--> GroupTail [6c9d2223]
    GroupTail [6c9d2223]. local=0, group=0. --[next]--> Node [4ea5d7f2]
    Node. Accept match
  Node. Accept match
Ques. Greedy optional quantifier
  Pos. Positive look-ahead
    GroupHead. local=4
    Prolog. Loop wrapper
    Loop [402c5f8a]. Greedy quantifier {0,2147483647}
      GroupHead. local=5
      Branch. Alternation (in printed order):
        CharProperty.complement. S̄:
          BitClass. Match any of these 1 character(s):
            "
        ---
        Single. Match code point: U+0022 QUOTATION MARK
        Curly. Greedy quantifier {0,2147483647}
          CharProperty.complement. S̄:
            BitClass. Match any of these 1 character(s):
              "
          Node. Accept match
        Single. Match code point: U+0022 QUOTATION MARK
        ---
      BranchConn [21347df0]. Connect branches to sequel.
      GroupTail [7d382897]. local=5, group=0. --[next]--> Loop [402c5f8a]
    Curly. Greedy quantifier {1,2147483647}
      Ctype. POSIX (US-ASCII): SPACE
      Node. Accept match
    Slice. Match the following sequence (BMP only version) (length=5)
      repo=
    GroupHead. local=7
    Curly. Greedy quantifier {1,2147483647}
      CharProperty.complement. S̄:
        Ctype. POSIX (US-ASCII): SPACE
      Node. Accept match
    GroupTail [71f111ba]. local=7, group=4. --[next]--> GroupTail [9c304c7]
    GroupTail [9c304c7]. local=4, group=0. --[next]--> Node [4ea5d7f2]
    Node. Accept match
  Node. Accept match
LastNode.
Node. Accept match

答案 1 :(得分:3)

最终答案:

将此(?:[^"]|"[^"]*")*功能移动到具有
的备用组 其他。

示例:https://ideone.com/YuVcMg

它无法破碎!

  

附注 - 我注意到你说你删除了一个换行符并最终得到了   一条记录的末尾没有隔板之间的分隔符,
  像这样request_category=apiapp=github

     

没关系,但这些正则表达式会在它击中时大部分被它击败   \S+

     

因此,最好将\S+替换为(?:(?!app=github)\S)+
  这不是在下面的正则表达式中完成的。   以下是添加的内容:

"(?s)app=github(?>\\s+user=(?<user>(?:(?!app=github)\\S)+)|\\s+repo=(?<repo>(?:(?!app=github)\\S)+)|\\s+remote_address=(?<ip>(?:(?!app=github)\\S)+)|\\s+now=\\\\?\"(?<time>(?:(?!app=github)\\S)+)\\+\\d\\d:\\d\\d\\\\?\"|\\s+url=\\\\?\"(?<url>(?:(?!app=github)\\S)+)\\\\?\"|\\s+referer=\\\\?\"(?<referer>(?:(?!app=github)\\S)+)\\\\?\"|\\s+status=(?<status>(?:(?!app=github)\\S)+)|\\s+elapsed=(?<elapsed>(?:(?!app=github)\\S)+)|\\s+request_method=(?<requestmethod>(?:(?!app=github)\\S)+)|\\s+created_at=\\\\?\"(?<createdat>(?:(?!app=github)\\S)+)[-+]\\d\\d:\\d\\d\\\\?\"|\\s+pull_request_id=(?<pullrequestid>\\d+)|\\s+at=(?<at>(?:(?!app=github)\\S)+)|\\s+fn=(?<fn>(?:(?!app=github)\\S)+)|\\s+method=(?<method>(?:(?!app=github)\\S)+)|\\s+current_user=(?<user2>(?:(?!app=github)\\S)+)|\\s+content_length=(?<contentlength>(?:(?!app=github)\\S)+)|\\s+request_category=(?<requestcategory>(?:(?!app=github)\\S)+)|\\s+controller=(?<controller>(?:(?!app=github)\\S)+)|\\s+action=(?<action>(?:(?!app=github)\\S)+)|\"[^\"]*\"|(?!app=github).)+"
  

使用该示例的链接:https://ideone.com/hdwufO

正则表达式

原始:

(?s)app=github(?>\s+user=(?<user>\S+)|\s+repo=(?<repo>\S+)|\s+remote_address=(?<ip>\S+)|\s+now=\\?"(?<time>\S+)\+\d\d:\d\d\\?"|\s+url=\\?"(?<url>\S+)\\?"|\s+referer=\\?"(?<referer>\S+)\\?"|\s+status=(?<status>\S+)|\s+elapsed=(?<elapsed>\S+)|\s+request_method=(?<requestmethod>\S+)|\s+created_at=\\?"(?<createdat>\S+)[-+]\d\d:\d\d\\?"|\s+pull_request_id=(?<pullrequestid>\d+)|\s+at=(?<at>\S+)|\s+fn=(?<fn>\S+)|\s+method=(?<method>\S+)|\s+current_user=(?<user2>\S+)|\s+content_length=(?<contentlength>\S+)|\s+request_category=(?<requestcategory>\S+)|\s+controller=(?<controller>\S+)|\s+action=(?<action>\S+)|"[^"]*"|(?!app=github).)+

Stringed:

"(?s)app=github(?>\\s+user=(?<user>\\S+)|\\s+repo=(?<repo>\\S+)|\\s+remote_address=(?<ip>\\S+)|\\s+now=\\\\?\"(?<time>\\S+)\\+\\d\\d:\\d\\d\\\\?\"|\\s+url=\\\\?\"(?<url>\\S+)\\\\?\"|\\s+referer=\\\\?\"(?<referer>\\S+)\\\\?\"|\\s+status=(?<status>\\S+)|\\s+elapsed=(?<elapsed>\\S+)|\\s+request_method=(?<requestmethod>\\S+)|\\s+created_at=\\\\?\"(?<createdat>\\S+)[-+]\\d\\d:\\d\\d\\\\?\"|\\s+pull_request_id=(?<pullrequestid>\\d+)|\\s+at=(?<at>\\S+)|\\s+fn=(?<fn>\\S+)|\\s+method=(?<method>\\S+)|\\s+current_user=(?<user2>\\S+)|\\s+content_length=(?<contentlength>\\S+)|\\s+request_category=(?<requestcategory>\\S+)|\\s+controller=(?<controller>\\S+)|\\s+action=(?<action>\\S+)|\"[^\"]*\"|(?!app=github).)+"

格式化:

 (?s)
 app = github
 (?>
      \s+ 
      user =
      (?<user> \S+ )                # (1)
   |  
      \s+  repo =
      (?<repo> \S+ )                # (2)
   |  
      \s+ remote_address =
      (?<ip> \S+ )                  # (3)
   |  
      \s+ now= \\? "
      (?<time> \S+ )                # (4)
      \+ \d\d : \d\d \\? "
   |  
      \s+ url = \\? "
      (?<url> \S+ )                 # (5)
      \\? "
   |  
      \s+ referer = \\? "
      (?<referer> \S+ )             # (6)
      \\? "
   |  
      \s+ status =
      (?<status> \S+ )              # (7)
   |  
      \s+ elapsed =
      (?<elapsed> \S+ )             # (8)
   |  
      \s+ request_method =
      (?<requestmethod> \S+ )       # (9)
   |  
      \s+ created_at = \\? "
      (?<createdat> \S+ )           # (10)
      [-+] 
      \d\d : \d\d \\? "
   |  
      \s+ pull_request_id =
      (?<pullrequestid> \d+ )       # (11)
   |  
      \s+ at=
      (?<at> \S+ )                  # (12)
   |  
      \s+ fn=
      (?<fn> \S+ )                  # (13)
   |  
      \s+ method =
      (?<method> \S+ )              # (14)
   |  
      \s+ current_user =
      (?<user2> \S+ )               # (15)
   |  
      \s+ content_length =
      (?<contentlength> \S+ )       # (16)
   |  
      \s+ request_categor y=
      (?<requestcategory> \S+ )     # (17)
   |  
      \s+ controller =
      (?<controller> \S+ )          # (18)
   |  
      \s+ action =
      (?<action> \S+ )              # (19)
   |  
      " [^"]* "                     # None of the above, give quotes a chance
   |  
      (?! app = github )            # Failsafe, consume a character, advance by 1
      . 
 )+