我想用这个例子来理解我想要实现的目标会更容易:
假设我们有白名单:one two three
。这个黑名单:four five
。然后:
three one two
是匹配的文字(包含所有白名单字词); one three two six
是匹配的文字(包含所有白名单字词); two one
不是匹配的文字(缺少白名单字three
); one four two three
不是匹配的文字(包含黑名单字four
)。有没有人可以帮我解决这个案例的正则表达式?
答案 0 :(得分:9)
这不是你想要使用正则表达式的东西。最好这样做(Python中的例子):
>>> whitelist = ["one", "two", "three"]
>>> blacklist = ["four", "five"]
>>> texts = ["three two one", "one three two six", "one two", "one two three four"]
>>> for text in texts:
... mytext = text.split()
... if all(word in mytext for word in whitelist) and \
... not any(word in mytext for word in blacklist):
... print(text)
...
three two one
one three two six
>>>
你可以做到这一点,但是:
^(?=.*\bone\b)(?=.*\btwo\b)(?=.*\bthree\b)(?!.*\bfour\b)(?!.*\bfive\b)
^
将搜索锚定在字符串的开头。(?=...)
确保其内容可以与当前位置匹配(?!...)
确保其内容无法与当前位置匹配\bone\b
匹配one
但不匹配lonely
。所以你得到:
>>> import re
>>> r = re.compile(r"^(?=.*\bone\b)(?=.*\btwo\b)(?=.*\bthree\b)(?!.*\bfour\b)(?!.*\bfive\b)")
>>> for text in texts:
... if r.match(text):
... print(text)
...
three two one
one three two six
答案 1 :(得分:3)
对于一个正则表达式来说,我的问题解决方案需要太多麻烦,因此我认为这不是一个好的(或事实上,在更仔细地查看之后)。计数和设置删除不是你想要构建到单个正则表达式中的。使用辅助逻辑。见蒂姆的解决方案。
使用Perl,选择工具:)进行这类工作:
@blacklist = qw(four five);
$blacklist = do { local $" = "|"; qr/@blacklist/ };
@whitelist = qw(one two three);
$whitelist = do { local $" = "|"; qr/@whitelist/ };
if ($string =~ /\b$whitelist\b/ && $string !~ /\b$blacklist\b/) { ... }
Perl会将这些编译成trie数据结构,因此无论你有多少替代方案,它都会非常快速地执行它们。在许多实际应用中,这种O(1)优化是非常重要的。
我误解了这个问题,因为想要一些白名单而没有黑名单,而不是想要所有白名单而不是黑名单。这不是我所谓的白名单。也许是一个想要列表和一个不想要的列表。
无论如何,这里有一个全有和无意义的更新,虽然当然为了提高性能,你可以改变顺序,让它变得一无所有。 (这假设单词不重叠;我可以执行重叠的情况,但没有它就更容易理解。)
#!/usr/bin/env perl
use strict;
use warnings;
my @blacklist = qw(four five);
my $blacklist = do { local $" = "|"; qr/@blacklist/ };
my @whitelist = qw(one two three);
my $whitelist = do { local $" = "|"; qr/@whitelist/ };
while (<DATA>) {
s/\s*#\s*(.*)$// && print "This $1\n\t";
if (/$blacklist/ == 0 && @whitelist == (() = /\b$whitelist\b/g)) {
print "GOOD: ";
} else {
print "EVIL: ";
}
print;
}
__END__
three one two # is a matching text (contains all whitelist words)
one three two six # is a matching text (contains all whitelist words)
two one # is not a matching text (lacks a whitelist word three)
one four two three # is not a matching text (contains a blacklist word four)
运行时报告:
This is a matching text (contains all whitelist words)
GOOD: three one two
This is a matching text (contains all whitelist words)
GOOD: one three two six
This is not a matching text (lacks a whitelist word three)
EVIL: two one
This is not a matching text (contains a blacklist word four)
EVIL: one four two three
要调试正则表达式,包括查看两个trie数据结构如何编译和执行,只需在命令行中包含-Mre=debug
或在代码中包含use re "debug";
。这就是产生的结果:
% perl -Mre=debug /tmp/trie
Compiling REx "\s*#\s*(.*)$"
synthetic stclass "ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}]".
Final program:
1: STAR (3)
2: SPACE (0)
3: EXACT <#> (5)
5: STAR (7)
6: SPACE (0)
7: OPEN1 (9)
9: STAR (11)
10: REG_ANY (0)
11: CLOSE1 (13)
13: EOL (14)
14: END (0)
floating "#" at 0..2147483647 (checking floating) stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] minlen 1
Compiling REx "four|five"
Final program:
1: EXACT <f> (3)
3: TRIE-EXACT[io] (7)
<our>
<ive>
7: END (0)
anchored "f" at 0 (checking anchored) minlen 4
Compiling REx "one|two|three"
Final program:
1: TRIEC-EXACT[ot] (11)
<one>
<two>
<three>
11: END (0)
stclass AHOCORASICKC-EXACT[ot] minlen 3
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "three one two # is a matching text (contains all whitelist w"...
Found floating substr "#" at offset 14...
start_shift: 0 check_at: 14 s: 0 endpos: 15
By STCLASS: moving 0 --> 5
Guessed: match at offset 5
Matching REx "\s*#\s*(.*)$" against " one two # is a matching text (contains all whitelist words)"...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " one two # is a matching text (contains all whitelist words)"... (61 bytes)
5 <three> < one two #> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
9 <e one> < two # is > | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
13 <e two> < # is a ma> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
14 < two > <# is a mat> | 3: EXACT <#>(5)
15 <two #> < is a matc> | 5: STAR(7)
SPACE can match 1 times out of 2147483647...
16 <wo # > <is a match> | 7: OPEN1(9)
16 <wo # > <is a match> | 9: STAR(11)
REG_ANY can match 49 times out of 2147483647...
65 <list words)> <%n> | 11: CLOSE1(13)
65 <list words)> <%n> | 13: EOL(14)
65 <list words)> <%n> | 14: END(0)
Match successful!
This is a matching text (contains all whitelist words)
Guessing start of match in sv for REx "four|five" against "three one two%n"
Did not find anchored substr "f"...
Match rejected by optimizer
Compiling REx "\b(?^:one|two|three)\b"
Final program:
1: BOUND (2)
2: TRIEC-EXACT[ot] (13)
<one>
<two>
<three>
13: BOUND (14)
14: END (0)
stclass BOUND minlen 3
Matching REx "\b(?^:one|two|three)\b" against "three one two%n"
Matching stclass BOUND against "three one tw" (12 bytes)
0 <> <three one > | 1:BOUND(2)
0 <> <three one > | 2:TRIEC-EXACT[ot](13)
0 <> <three one > | State: 1 Accepted: N Charid: 4 CP: 74 After State: 5
1 <t> <hree one t> | State: 5 Accepted: N Charid: 6 CP: 68 After State: 8
2 <th> <ree one tw> | State: 8 Accepted: N Charid: 7 CP: 72 After State: 9
3 <thr> <ee one two> | State: 9 Accepted: N Charid: 3 CP: 65 After State: a
4 <thre> <e one two> | State: a Accepted: N Charid: 3 CP: 65 After State: b
5 <three> < one two%n> | State: b Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #3, continuing
only one match left, short-circuiting: #3 <three>
5 <three> < one two%n> | 13:BOUND(14)
5 <three> < one two%n> | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " one two%n"
Matching stclass BOUND against " one tw" (7 bytes)
5 <three> < one two%n> | 1:BOUND(2)
5 <three> < one two%n> | 2:TRIEC-EXACT[ot](13)
failed to match trie start class...
6 <hree > <one two%n> | 1:BOUND(2)
6 <hree > <one two%n> | 2:TRIEC-EXACT[ot](13)
6 <hree > <one two%n> | State: 1 Accepted: N Charid: 1 CP: 6f After State: 2
7 <ree o> <ne two%n> | State: 2 Accepted: N Charid: 2 CP: 6e After State: 3
8 <ree on> <e two%n> | State: 3 Accepted: N Charid: 3 CP: 65 After State: 4
9 <ree one> < two%n> | State: 4 Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #1, continuing
only one match left, short-circuiting: #1 <one>
9 <ree one> < two%n> | 13:BOUND(14)
9 <ree one> < two%n> | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " two%n"
Matching stclass BOUND against " tw" (3 bytes)
9 <ree one> < two%n> | 1:BOUND(2)
9 <ree one> < two%n> | 2:TRIEC-EXACT[ot](13)
failed to match trie start class...
10 <ree one > <two%n> | 1:BOUND(2)
10 <ree one > <two%n> | 2:TRIEC-EXACT[ot](13)
10 <ree one > <two%n> | State: 1 Accepted: N Charid: 4 CP: 74 After State: 5
11 <ree one t> <wo%n> | State: 5 Accepted: N Charid: 5 CP: 77 After State: 6
12 <ree one tw> <o%n> | State: 6 Accepted: N Charid: 1 CP: 6f After State: 7
13 <ree one two> <%n> | State: 7 Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #2, continuing
only one match left, short-circuiting: #2 <two>
13 <ree one two> <%n> | 13:BOUND(14)
13 <ree one two> <%n> | 14:END(0)
Match successful!
GOOD: three one two
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "one three two six # is a matching text (contains all whiteli"...
Found floating substr "#" at offset 18...
start_shift: 0 check_at: 18 s: 0 endpos: 19
By STCLASS: moving 0 --> 3
Guessed: match at offset 3
Matching REx "\s*#\s*(.*)$" against " three two six # is a matching text (contains all whitelist "...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " three two six # is a matching text (contains all whitelist "... (67 bytes)
3 <one> < three two> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
9 <three> < two six #> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
13 <e two> < six # is > | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
17 <o six> < # is a ma> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
18 < six > <# is a mat> | 3: EXACT <#>(5)
19 <six #> < is a matc> | 5: STAR(7)
SPACE can match 1 times out of 2147483647...
20 <ix # > <is a match> | 7: OPEN1(9)
20 <ix # > <is a match> | 9: STAR(11)
REG_ANY can match 49 times out of 2147483647...
69 <list words)> <%n> | 11: CLOSE1(13)
69 <list words)> <%n> | 13: EOL(14)
69 <list words)> <%n> | 14: END(0)
Match successful!
This is a matching text (contains all whitelist words)
Guessing start of match in sv for REx "four|five" against "one three two six%n"
Did not find anchored substr "f"...
Match rejected by optimizer
Matching REx "\b(?^:one|two|three)\b" against "one three two six%n"
Matching stclass BOUND against "one three two si" (16 bytes)
0 <> <one three > | 1:BOUND(2)
0 <> <one three > | 2:TRIEC-EXACT[ot](13)
0 <> <one three > | State: 1 Accepted: N Charid: 1 CP: 6f After State: 2
1 <o> <ne three t> | State: 2 Accepted: N Charid: 2 CP: 6e After State: 3
2 <on> <e three tw> | State: 3 Accepted: N Charid: 3 CP: 65 After State: 4
3 <one> < three two> | State: 4 Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #1, continuing
only one match left, short-circuiting: #1 <one>
3 <one> < three two> | 13:BOUND(14)
3 <one> < three two> | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " three two six%n"
Matching stclass BOUND against " three two si" (13 bytes)
3 <one> < three two> | 1:BOUND(2)
3 <one> < three two> | 2:TRIEC-EXACT[ot](13)
failed to match trie start class...
4 <one > <three two > | 1:BOUND(2)
4 <one > <three two > | 2:TRIEC-EXACT[ot](13)
4 <one > <three two > | State: 1 Accepted: N Charid: 4 CP: 74 After State: 5
5 <one t> <hree two s> | State: 5 Accepted: N Charid: 6 CP: 68 After State: 8
6 <ne th> <ree two si> | State: 8 Accepted: N Charid: 7 CP: 72 After State: 9
7 <e thr> <ee two six> | State: 9 Accepted: N Charid: 3 CP: 65 After State: a
8 < thre> <e two six> | State: a Accepted: N Charid: 3 CP: 65 After State: b
9 <three> < two six%n> | State: b Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #3, continuing
only one match left, short-circuiting: #3 <three>
9 <three> < two six%n> | 13:BOUND(14)
9 <three> < two six%n> | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " two six%n"
Matching stclass BOUND against " two si" (7 bytes)
9 <three> < two six%n> | 1:BOUND(2)
9 <three> < two six%n> | 2:TRIEC-EXACT[ot](13)
failed to match trie start class...
10 <hree > <two six%n> | 1:BOUND(2)
10 <hree > <two six%n> | 2:TRIEC-EXACT[ot](13)
10 <hree > <two six%n> | State: 1 Accepted: N Charid: 4 CP: 74 After State: 5
11 <ree t> <wo six%n> | State: 5 Accepted: N Charid: 5 CP: 77 After State: 6
12 <ree tw> <o six%n> | State: 6 Accepted: N Charid: 1 CP: 6f After State: 7
13 <ree two> < six%n> | State: 7 Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #2, continuing
only one match left, short-circuiting: #2 <two>
13 <ree two> < six%n> | 13:BOUND(14)
13 <ree two> < six%n> | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " six%n"
Matching stclass BOUND against " si" (3 bytes)
13 <ree two> < six%n> | 1:BOUND(2)
13 <ree two> < six%n> | 2:TRIEC-EXACT[ot](13)
failed to match trie start class...
14 <ree two > <six%n> | 1:BOUND(2)
14 <ree two > <six%n> | 2:TRIEC-EXACT[ot](13)
failed to match trie start class...
Contradicts stclass... [regexec_flags]
Match failed
GOOD: one three two six
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "two one # is not a matching text (lacks a whitelist word thr"...
Found floating substr "#" at offset 8...
start_shift: 0 check_at: 8 s: 0 endpos: 9
By STCLASS: moving 0 --> 3
Guessed: match at offset 3
Matching REx "\s*#\s*(.*)$" against " one # is not a matching text (lacks a whitelist word three)"...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " one # is not a matching text (lacks a whitelist word three)"... (61 bytes)
3 <two> < one # is > | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
7 <o one> < # is not > | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
8 < one > <# is not a> | 3: EXACT <#>(5)
9 <one #> < is not a > | 5: STAR(7)
SPACE can match 1 times out of 2147483647...
10 <ne # > <is not a m> | 7: OPEN1(9)
10 <ne # > <is not a m> | 9: STAR(11)
REG_ANY can match 53 times out of 2147483647...
63 <word three)> <%n> | 11: CLOSE1(13)
63 <word three)> <%n> | 13: EOL(14)
63 <word three)> <%n> | 14: END(0)
Match successful!
This is not a matching text (lacks a whitelist word three)
Guessing start of match in sv for REx "four|five" against "two one%n"
Did not find anchored substr "f"...
Match rejected by optimizer
Matching REx "\b(?^:one|two|three)\b" against "two one%n"
Matching stclass BOUND against "two on" (6 bytes)
0 <> <two one%n> | 1:BOUND(2)
0 <> <two one%n> | 2:TRIEC-EXACT[ot](13)
0 <> <two one%n> | State: 1 Accepted: N Charid: 4 CP: 74 After State: 5
1 <t> <wo one%n> | State: 5 Accepted: N Charid: 5 CP: 77 After State: 6
2 <tw> <o one%n> | State: 6 Accepted: N Charid: 1 CP: 6f After State: 7
3 <two> < one%n> | State: 7 Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #2, continuing
only one match left, short-circuiting: #2 <two>
3 <two> < one%n> | 13:BOUND(14)
3 <two> < one%n> | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " one%n"
Matching stclass BOUND against " on" (3 bytes)
3 <two> < one%n> | 1:BOUND(2)
3 <two> < one%n> | 2:TRIEC-EXACT[ot](13)
failed to match trie start class...
4 <two > <one%n> | 1:BOUND(2)
4 <two > <one%n> | 2:TRIEC-EXACT[ot](13)
4 <two > <one%n> | State: 1 Accepted: N Charid: 1 CP: 6f After State: 2
5 <two o> <ne%n> | State: 2 Accepted: N Charid: 2 CP: 6e After State: 3
6 <two on> <e%n> | State: 3 Accepted: N Charid: 3 CP: 65 After State: 4
7 <two one> <%n> | State: 4 Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #1, continuing
only one match left, short-circuiting: #1 <one>
7 <two one> <%n> | 13:BOUND(14)
7 <two one> <%n> | 14:END(0)
Match successful!
EVIL: two one
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "one four two three # is not a matching text (contains a blac"...
Found floating substr "#" at offset 19...
start_shift: 0 check_at: 19 s: 0 endpos: 20
By STCLASS: moving 0 --> 3
Guessed: match at offset 3
Matching REx "\s*#\s*(.*)$" against " four two three # is not a matching text (contains a blackli"...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " four two three # is not a matching text (contains a blackli"... (74 bytes)
3 <one> < four two > | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
8 < four> < two three> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
12 <r two> < three # i> | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
failed...
18 <three> < # is not > | 1:STAR(3)
SPACE can match 1 times out of 2147483647...
19 <hree > <# is not a> | 3: EXACT <#>(5)
20 <ree #> < is not a > | 5: STAR(7)
SPACE can match 1 times out of 2147483647...
21 <ee # > <is not a m> | 7: OPEN1(9)
21 <ee # > <is not a m> | 9: STAR(11)
REG_ANY can match 55 times out of 2147483647...
76 < word four)> <%n> | 11: CLOSE1(13)
76 < word four)> <%n> | 13: EOL(14)
76 < word four)> <%n> | 14: END(0)
Match successful!
This is not a matching text (contains a blacklist word four)
Guessing start of match in sv for REx "four|five" against "one four two three%n"
Found anchored substr "f" at offset 4...
Starting position does not contradict /^/m...
Guessed: match at offset 4
Matching REx "four|five" against "four two three%n"
4 <one > <four two t> | 1:EXACT <f>(3)
5 <one f> <our two th> | 3:TRIE-EXACT[io](7)
5 <one f> <our two th> | State: 2 Accepted: N Charid: 2 CP: 6f After State: 3
6 <ne fo> <ur two thr> | State: 3 Accepted: N Charid: 3 CP: 75 After State: 4
7 <e fou> <r two thre> | State: 4 Accepted: N Charid: 4 CP: 72 After State: 5
8 < four> < two three> | State: 5 Accepted: Y Charid: 0 CP: 0 After State: 0
got 1 possible matches
TRIE matched word #1, continuing
only one match left, short-circuiting: #1 <our>
8 < four> < two three> | 7:END(0)
Match successful!
EVIL: one four two three
Freeing REx: "one|two|three"
Freeing REx: "\s*#\s*(.*)$"
我知道没有其他语言允许你像这样调试你的正则表达式编译和执行,这本身就值得我书上的录取价格。