我有一个字符串,对它运行一个正则表达式,但出现错误。
PHP:
<?php
$str = 'modified: apps/aaaaaa/bbbbbb/cccc
modified: apps/ami (new commits)
modified: apps/assess (new commits)
modified: apps/ees121 (new commits)
modified: apps/energy_bourse (new commits)
modified: apps/gis (new commits)
modified: apps/hse (new commits)
modified: apps/aa/aaa/a/bb/b/bb/bc/c22/s/df/s/
modified: apps/management (new commits)
modified: apps/payesh (new commits)
modified: external_apps (modified content)
modified: modules/esb_server (new commits)
modified: modules/formbuilder (new commits, modified content)
modified: modules/reporting (new commits)
modified: modules/safir (new commits)
modified: modules/workflow (new commits)
modified: vendor/raya_framework/client (new commits)
modified: vendor/raya_framework/core (new commits)';
preg_match_all("/modified:\s+((\w+\/?)+).*\)/", $str, $matches);
var_dump($matches);
这很好用,但是如果我在其中的一行中添加几个字符,则没有匹配项。例如:
modified: apps/aaaaaa/bbbbbb/ccccfff
这似乎取决于文字字符,而不是正斜杠。为什么有些字符在这里有所作为?那我该怎么办?
答案 0 :(得分:2)
那些多余的字符使正则表达式引擎达到回溯步骤的限制:
var_dump(preg_last_error() === PREG_BACKTRACK_LIMIT_ERROR); // will return `true`
您的正则表达式几乎很短,看起来似乎正确,但实际上,它滥用了量词,并导致Catastrophic Backtracking在失败时发生。当它在)
模式序列的末尾与\w+/?
不匹配时,它将尝试回溯到所有以前的子表达式中,以期找到一个)
。但是它永远不会发生,嵌套的量化组和令牌使此过程像永远运行一样。
该解决方案正在重新构建您的正则表达式以考虑此问题:
modified:\s+((?>\w+\/?)+).*\)
我刚刚将第二个捕获组设为原子组。原子组(顾名思义)不允许回溯到群集中。因此,如果在匹配\w+\/?
后找不到模式,则它永远不会回退到\w+\/?
中,这会导致早期故障。
对此正则表达式的正确修改是将.*
替换为更具限制性的内容:
modified:\s+((?>\w+\/?)+)[^)\v]*\)
PHP代码:
preg_match_all('~modified:\s+((?>\w+/?)+)[^)\v]*\)~', $str, $matches);