Question

我已经阅读了the articles和other questions关于正则表达式中的灾难性回溯以及它如何由嵌套的+和*量词引起的内容。但是，我的正则表达式仍然遇到灾难性的回溯没有嵌套量词。有人能帮助我理解为什么吗？

我写了这些正则表达式来搜索威尔士诗歌中的a specific type of rhyme。押韵包括最后一行重复的所有辅音，并且在开始和结束辅音之间必须有一个空格。我已经删除了所有的元音，但有两个例外使这些正则表达式变得难看。首先，允许在中间的辅音不重复，如果有的话，它是不同类型的押韵。其次，字母m，n，r，h和v被允许中断押韵（出现在开头但不在结尾，反之亦然），但它们不能被忽略，因为有时押韵只包括那些字母。

我的脚本会自动为每一行构建一个正则表达式并对其进行测试。它在其余时间工作，但这一行是给灾难性的回溯。该行没有元音的文字是：

nn  Frvvn  Frv v

正则表达式会自动发现nn Frvvn与Frv v押韵，然后再次尝试使用后面所需的最后一个字母（n中的Frvvn）。如果不需要，则可以缩短押韵。这是正则表达式：

^(?P<s_letters>         # starting letters
[mnrhv]*?\s*n{0,2}      # any number of optional letters or any number
                        # of spaces can come between rhyming letters
[mnrhv]*?\s*n{0,2}
[mnrhv]*?\s*F{1,2}
[mnrhv]*?\s*[rR]?(?:\s*[rR])? # r can also rhyme with R, but that's
                              # not relevant here (I think)
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*n{1,2}
[mnrhv\s]*?)
(?P<m_letters>          # middle letters
[^\s]*?(?P<caesura>\s)  # the caesura (end of the rhyme) is the
                        # first space after the rhyme     
.*)                     # End letters come as late as possible
(?P<e_letters>          # End group
[mnrhv]*?\s*n{0,2}
[mnrhv]*?\s*n{0,2}
[mnrhv]*?\s*F{1,2}
[mnrhv]*?\s*[rR]?(?:\s*[rR])?
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*n{1,2}
[mnrhv\s]*?)$

即使它没有任何嵌套量词，它仍然需要永远运行。以相同方式生成的其他行的正则表达式可以快速运行。这是为什么？

Answer 1

我没有看到任何嵌套量词，但我看到很多歧义会导致高指数多项式运行时。例如，考虑正则表达式的这一部分：

[mnrhv]*?\s*[rR]?(?:\s*[rR])? # r can also rhyme with R, but that's
                              # not relevant here (I think)
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*v{0,2}
[mnrhv]*?\s*n{1,2}
[mnrhv\s]*?)
(?P<m_letters>          # middle letters
[^\s]*?(?P<caesura>\s)  # the caesura (end of the rhyme) is the

假设正则表达式引擎就在这一点上，它所看到的文本只是n的一大块。那些n可以在正则表达式的以下部分之间划分：

[mnrhv]*?\s*[rR]?(?:\s*[rR])?
^^^^^^^^^

[mnrhv]*?\s*v{0,2}
^^^^^^^^^

[mnrhv]*?\s*v{0,2}
^^^^^^^^^
[mnrhv]*?\s*n{1,2}
^^^^^^^^^   ^^^^^^
[mnrhv\s]*?)
^^^^^^^^^^^
(?P<m_letters>
[^\s]*?(?P<caesura>\s)
^^^^^^^

如果n s的数量为N，那么有O(N**6)种方法可以划分n，因为有*?块，在这里匹配n，其间的所有内容都是可选的，或者也匹配n。

这些\s部分是强制性的吗？如果是这样，您可以通过在其上放置+而不是*来改善运行时。

为什么这个正则表达式经历了灾难性的回溯？

1 个答案: