Question

我对TCL进行了以下测试：

set v z[string repeat t 10000]b[string repeat t 10000]g[string repeat t 10000]z

如果我只使用匹配模式的regexp - 确定：

time {regexp {z.*?b.*?g(.+?)z} $v} 20
[TCL_OK] 340.4 microseconds per iteration

但如果我想获得submatch，regexp应用得非常慢：

time {regexp {z.*?b.*?g(.+?)z} $v -> asd} 5
[TCL_OK] 157007.4 microseconds per iteration

我的正则表达式有什么问题以及为什么regexp仅在子匹配返回模式下应用太慢？

我使用以下环境：

parray tcl_platform
tcl_platform(byteOrder)     = littleEndian
tcl_platform(machine)       = intel
tcl_platform(os)            = Windows NT
tcl_platform(osVersion)     = 6.1
tcl_platform(pathSeparator) = ;
tcl_platform(platform)      = windows
tcl_platform(pointerSize)   = 4
tcl_platform(threaded)      = 1
tcl_platform(user)          = kot
tcl_platform(wordSize)      = 4
[TCL_OK]
puts $tcl_patchLevel
8.6.0
[TCL_OK]

更新。其他测试：

非捕获匹配 - 最佳时间：

time {regexp {z.*?b.*?g(.+?)z} $v} 5
[TCL_OK] 1178.2 microseconds per iteration

捕捉 - 匹配，非贪婪 - 时间不好：

time {regexp {z.*?b.*?g(.+?)z} $v -> asd} 5
[TCL_OK] 13796072.4 microseconds per iteration

捕捉 - 匹配，贪婪 - 时间好了：

time {regexp {z.*b.*g(.+)z} $v -> asd} 5
[TCL_OK] 7097.4 microseconds per iteration
string length $asd
[TCL_OK] 100007

捕捉 - 匹配，非贪婪+贪婪+贪婪 - 时间非常糟糕：

time {regexp {z.*?b.*g(.+)z} $v -> asd} 5
[TCL_OK] 38177041.6 microseconds per iteration
string length $asd
[TCL_OK] 100000

最后，捕捉 - 匹配，非贪婪+非贪婪+贪婪 - 匹配是非贪婪的，时间还可以：

time {regexp {z.*?b.*?g(.+)z} $v -> asd} 5
[TCL_OK] 4157.0 microseconds per iteration
string length $asd
[TCL_OK] 100000

Tcl的RE引擎工作对我来说非常难以预测。

Answer 1

非捕获括号比捕获括号更快（因为它们允许使用更优化的编译策略），因此在可能的情况下，Tcl的RE引擎在内部使用非捕获形式。什么时候可能？当正则表达式中没有反向引用（\1）时，并且不使用从外部捕获的子字符串（Tcl传入的信息）。通过添加额外的参数来捕获子字符串，你在RE编译器中强制使用效率较低的路径（当然，当你获得更多信息时）。

[编辑]事实证明，非贪婪的RE在使用Tcl的当前RE引擎捕获括号时效果不佳。（不知道为什么;代码有点复杂。好吧，很多很复杂！）但是可以用一种可以快速匹配的方式编写这个特定的正则表达式。

首先，我的机器的时间缩放：

% time {regexp {z.*?b.*?g(.+?)z} $v} 2000
98.98675999999999 microseconds per iteration

为了比较，这里是一个没有parens的贪婪版本（稍快，但不是很多）：

time {regexp {z.*b.*g.+z} $v} 2000
96.954045 microseconds per iteration

接下来，原来的慢速匹配：

% time {regexp {z.*?b.*?g(.+?)z} $v -> asd} 50
163337.53884 microseconds per iteration
% string length $asd
10000

现在，版本更快：

% time {regexp {z[^b]*b[^g]*g([^z]*)z} $v -> asd} 5000
341.0937716 microseconds per iteration
% string length $asd
10000

这使用贪婪匹配，而是通过用.*?b替换（例如）[^b]*b来减少回溯。请注意，您仍然可以看到使用捕获的成本，但至少这可以很快地工作并捕获相同范围的字符。

我猜你经验的匹配非常缓慢，因为发动机回溯很多。

使用子匹配时tcl中的正则表达式匹配太慢

1 个答案: