我一直在用几个前瞻的表达式观察极慢的执行时间。
我认为这是由于底层数据结构,但它看起来非常极端,我想知道我做错了什么或是否有已知的解决方法。
问题在于确定字符串中是否存在一组单词,以任何顺序排列。例如,我们想要找出两个术语“term1”和“term2”是否在字符串中的某个位置。我这样做是为了表达:
(?=.*\bterm1\b)(?=.*\bterm2\b)
但我观察到的是,这比先检查
慢了一个数量级\bterm1\b
然后才
\bterm2\b
这似乎表明我应该使用一组模式而不是单一模式与前瞻......这是对的吗?这似乎不对......
以下是一个示例测试代码和结果时间:
public static void speedLookAhead() {
Matcher m, m1, m2;
boolean find;
int its = 1000000;
// create long non-matching string
char[] str = new char[2000];
for (int i = 0; i < str.length; i++) {
str[i] = 'x';
}
String test = str.toString();
// First method: use one expression with lookaheads
m = Pattern.compile("(?=.*\\bterm1\\b)(?=.*\\bterm2\\b)").matcher(test);
long time = System.currentTimeMillis();
;
for (int i = 0; i < its; i++) {
m.reset(test);
find = m.find();
}
time = System.currentTimeMillis() - time;
System.out.println(time);
// Second method: use two expressions and AND the results
m1 = Pattern.compile("\\bterm1\\b").matcher(test);
m2 = Pattern.compile("\\bterm2\\b").matcher(test);
time = System.currentTimeMillis();
;
for (int i = 0; i < its; i++) {
m1.reset(test);
m2.reset(test);
find = m1.find() && m2.find();
}
time = System.currentTimeMillis() - time;
System.out.println(time);
}
这在我的电脑中输出:
1754
150
答案 0 :(得分:2)
这可能会刮掉一些时间
贪婪
([AB]).*(?!\1)[AB]
非贪婪
([AB]).*?(?!\1)[AB]
重做
我已就这个问题做了自己的工作。在/term/
之类的时间匹配单个词
相对于一个正则表达式中的两个术语总是花费更少的时间,因为它没有
回溯。它和strncmp(术语)一样简单。然后分别做2个术语
更快。
如果你可以定义不存在重叠可能性的术语,那么这就是
要走的路。即; / term1 /&amp;&amp; / TERM2 /.
在不调用回溯的情况下,无法将术语组合到单个正则表达式中。
也就是说,如果您真的关心重叠,那么有一些技术可以最小化 回溯。
/(?=。* A)(?=。* B)/就像/ A /&amp;&amp; / B /除了它看起来慢了很多,都没有说明重叠。
所以,如果你真的关心重叠(我强烈建议你这样做),那就有了 有两种方法可以结合起来以实现最高效率。
/(A | B)。*(?!\ 1)(?:A | B)/
或
/ A /&amp;&amp; / B /&amp;&amp; /(A | B)。*(?!\ 1)(?:A | B)/
这最后一个会增加一个小的(相对)开销,但可以禁止逻辑中的访问 在检查重叠之前,要求A和B至少存在。
并且,根据字符串中A和B的位置,/(A | B)。*(?!\ 1)(?:A | B)/
也可能需要时间,但它仍然是所有事物的最短路径
平均值。
下面是一个Perl程序,它对一些示例(可能的场景)字符串进行基准测试。
祝你好运!use strict;
use warnings;
use Benchmark ':hireswallclock';
my ($t0,$t1);
my ($term1, $term2) = ('term','m2a');
my @samples = (
' xaaaaaaa term2ater ',
' xaaaaaaa term2aterm ',
' xaaaaaaa ter2ater ',
' Aaa term2ater ' . 'x 'x100 . 'xaaaaaaa mta ',
' Baa term ' . 'x 'x100 . 'xaaaaaaa mta ',
' Caa m2a ' . 'x 'x100 . 'xaaaaaaa term ',
' Daa term2a ' . 'x 'x100 . 'xaaaaaaa term ',
);
my $rxA = qr/$term1/;
my $rxB = qr/$term2/;
my $rxAB = qr/ ($term1|$term2) .* (?!\1)(?:$term1|$term2) /x;
for (@samples)
{
printf "Checking string: '%.40s'\n-------------\n", $_;
if (/$term1/ && /$term2/ ) {
print " Found possible candidates (A && B)\n";
}
if (/ ($term1|$term2) .* ((?!\1)(?:$term1|$term2)) /x) {
print " Found non-overlaped terms: '$1' '$2'\n";
}
else {
print " No (A|B) .* (?!\\1)(A|B) terms found!\n";
}
print "\n Bench\n";
$t0 = new Benchmark;
for my $cnt (1 .. 500_000) {
/$rxA/ && /$rxB/;
}
$t1 = new Benchmark;
print " $rxA && $rxB\n -took: ", timestr(timediff($t1, $t0)), "\n\n";
$t0 = new Benchmark;
for my $cnt (1 .. 500_000) {
/$rxAB/;
}
$t1 = new Benchmark;
print " $rxAB\n -took: ", timestr(timediff($t1, $t0)), "\n\n";
$t0 = new Benchmark;
for my $cnt (1 .. 500_000) {
/$rxA/ && /$rxB/ && /$rxAB/;
}
$t1 = new Benchmark;
print " $rxA && $rxB &&\n $rxAB\n -took: ", timestr(timediff($t1, $t0)), "\n\n";
}
输出
Checking string: ' xaaaaaaa term2ater '
-------------
Found possible candidates (A && B)
No (A|B) .* (?!\1)(A|B) terms found!
Bench
(?-xism:term) && (?-xism:m2a)
-took: 1.46875 wallclock secs ( 1.47 usr + 0.00 sys = 1.47 CPU)
(?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
-took: 3.3748 wallclock secs ( 3.34 usr + 0.00 sys = 3.34 CPU)
(?-xism:term) && (?-xism:m2a) &&
(?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
-took: 5.0623 wallclock secs ( 5.06 usr + 0.00 sys = 5.06 CPU)
Checking string: ' xaaaaaaa term2aterm '
-------------
Found possible candidates (A && B)
Found non-overlaped terms: 'm2a' 'term'
Bench
(?-xism:term) && (?-xism:m2a)
-took: 1.48403 wallclock secs ( 1.49 usr + 0.00 sys = 1.49 CPU)
(?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
-took: 3.89044 wallclock secs ( 3.89 usr + 0.00 sys = 3.89 CPU)
(?-xism:term) && (?-xism:m2a) &&
(?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
-took: 5.40607 wallclock secs ( 5.38 usr + 0.00 sys = 5.38 CPU)
Checking string: ' xaaaaaaa ter2ater '
-------------
No (A|B) .* (?!\1)(A|B) terms found!
Bench
(?-xism:term) && (?-xism:m2a)
-took: 0.765321 wallclock secs ( 0.77 usr + 0.00 sys = 0.77 CPU)
(?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
-took: 1.29674 wallclock secs ( 1.30 usr + 0.00 sys = 1.30 CPU)
(?-xism:term) && (?-xism:m2a) &&
(?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
-took: 0.874842 wallclock secs ( 0.88 usr + 0.00 sys = 0.88 CPU)
Checking string: ' Aaa term2ater x x x x x x x x x x x x x'
-------------
Found possible candidates (A && B)
No (A|B) .* (?!\1)(A|B) terms found!
Bench
(?-xism:term) && (?-xism:m2a)
-took: 1.46842 wallclock secs ( 1.47 usr + 0.00 sys = 1.47 CPU)
(?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
-took: 28.078 wallclock secs (28.08 usr + 0.00 sys = 28.08 CPU)
(?-xism:term) && (?-xism:m2a) &&
(?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
-took: 29.4531 wallclock secs (29.45 usr + 0.00 sys = 29.45 CPU)
Checking string: ' Baa term x x x x x x x x x x x x x'
-------------
No (A|B) .* (?!\1)(A|B) terms found!
Bench
(?-xism:term) && (?-xism:m2a)
-took: 1.68716 wallclock secs ( 1.69 usr + 0.00 sys = 1.69 CPU)
(?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
-took: 15.1563 wallclock secs (15.16 usr + 0.00 sys = 15.16 CPU)
(?-xism:term) && (?-xism:m2a) &&
(?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
-took: 1.64033 wallclock secs ( 1.64 usr + 0.00 sys = 1.64 CPU)
Checking string: ' Caa m2a x x x x x x x x x x x x x'
-------------
Found possible candidates (A && B)
Found non-overlaped terms: 'm2a' 'term'
Bench
(?-xism:term) && (?-xism:m2a)
-took: 1.62448 wallclock secs ( 1.63 usr + 0.00 sys = 1.63 CPU)
(?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
-took: 3.0154 wallclock secs ( 3.02 usr + 0.00 sys = 3.02 CPU)
(?-xism:term) && (?-xism:m2a) &&
(?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
-took: 4.56226 wallclock secs ( 4.56 usr + 0.00 sys = 4.56 CPU)
Checking string: ' Daa term2a x x x x x x x x x x x '
-------------
Found possible candidates (A && B)
Found non-overlaped terms: 'm2a' 'term'
Bench
(?-xism:term) && (?-xism:m2a)
-took: 1.45252 wallclock secs ( 1.45 usr + 0.00 sys = 1.45 CPU)
(?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
-took: 16.1404 wallclock secs (16.14 usr + 0.00 sys = 16.14 CPU)
(?-xism:term) && (?-xism:m2a) &&
(?x-ism: (term|m2a) .* (?!\1)(?:term|m2a) )
-took: 17.6719 wallclock secs (17.67 usr + 0.00 sys = 17.67 CPU)
答案 1 :(得分:1)
您需要将每个循环放在一个单独的方法中,如果交换测试的顺序,您将得到不同的结果。
你可以将它与test.indexOf('A') >= 0 && test.indexOf('B') >= 0
进行比较,因为我想这会更快吗?
答案 2 :(得分:1)
您发布的正则表达式
(?=.\A\b)(?=.\B\b)
与代码中的
不匹配.(?=.*B)(?=.*A)
事实上,第一个正则表达式似乎无法匹敌。
你能给出一些应该匹配的东西和不匹配的东西。
这是代码解释的正则表达式。
Match any single character that is not a line break character «.»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=.*B)»
Match any single character that is not a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “B” literally «B»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=.*A)»
Match any single character that is not a line break character «.*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “A” literally «A»