仅当文本包含白名单中的所有单词时才匹配文本,但黑名单中不包含任何单词

时间:2011-11-04 15:38:25

标签: regex

我想用这个例子来理解我想要实现的目标会更容易:

假设我们有白名单one two three。这个黑名单four five。然后:

  • three one two是匹配的文字(包含所有白名单字词);
  • one three two six是匹配的文字(包含所有白名单字词);
  • two one不是匹配的文字(缺少白名单字three);
  • one four two three不是匹配的文字(包含黑名单字four)。

有没有人可以帮我解决这个案例的正则表达式?

2 个答案:

答案 0 :(得分:9)

这不是你想要使用正则表达式的东西。最好这样做(Python中的例子):

>>> whitelist = ["one", "two", "three"]
>>> blacklist = ["four", "five"]
>>> texts = ["three two one", "one three two six", "one two", "one two three four"]
>>> for text in texts:
...     mytext = text.split()
...     if all(word in mytext for word in whitelist) and \
...        not any(word in mytext for word in blacklist):
...         print(text)
...
three two one
one three two six
>>>

可以做到这一点,但是:

^(?=.*\bone\b)(?=.*\btwo\b)(?=.*\bthree\b)(?!.*\bfour\b)(?!.*\bfive\b)
  • ^将搜索锚定在字符串的开头。
  • (?=...)确保其内容可以与当前位置匹配
  • (?!...)确保其内容无法与当前位置匹配
  • \bone\b匹配one但不匹配lonely

所以你得到:

>>> import re
>>> r = re.compile(r"^(?=.*\bone\b)(?=.*\btwo\b)(?=.*\bthree\b)(?!.*\bfour\b)(?!.*\bfive\b)")
>>> for text in texts:
...     if r.match(text):
...         print(text)
...
three two one
one three two six

答案 1 :(得分:3)

修改

对于一个正则表达式来说,我的问题解决方案需要太多麻烦,因此我认为这不是一个好的(或事实上,在更仔细地查看之后)。计数和设置删除不是你想要构建到单个正则表达式中的。使用辅助逻辑。见蒂姆的解决方案。


使用Perl,选择工具:)进行这类工作:

@blacklist = qw(four five);
$blacklist = do { local $" = "|"; qr/@blacklist/ };

@whitelist = qw(one two three);
$whitelist = do { local $" = "|"; qr/@whitelist/ };

if ($string =~ /\b$whitelist\b/ && $string !~ /\b$blacklist\b/) { ... } 

Perl会将这些编译成trie数据结构,因此无论你有多少替代方案,它都会非常快速地执行它们。在许多实际应用中,这种O(1)优化是非常重要的。


修改

我误解了这个问题,因为想要一些白名单而没有黑名单,而不是想要所有白名单而不是黑名单。这不是我所谓的白名单。也许是一个想要列表和一个不想要的列表。

无论如何,这里有一个全有和无意义的更新,虽然当然为了提高性能,你可以改变顺序,让它变得一无所有。 (这假设单词不重叠;我可以执行重叠的情况,但没有它就更容易理解。)

#!/usr/bin/env perl   
use strict;
use warnings;

my @blacklist = qw(four five);
my $blacklist = do { local $" = "|"; qr/@blacklist/ };

my @whitelist = qw(one two three);
my $whitelist = do { local $" = "|"; qr/@whitelist/ };

while (<DATA>) {
    s/\s*#\s*(.*)$// && print "This $1\n\t";
    if (/$blacklist/ == 0 && @whitelist == (() = /\b$whitelist\b/g)) { 
        print "GOOD: ";
    } else {
        print "EVIL: ";
    } 
    print;
}     
__END__
three one two # is a matching text (contains all whitelist words)
one three two six # is a matching text (contains all whitelist words)
two one # is not a matching text (lacks a whitelist word three)
one four two three # is not a matching text (contains a  blacklist word four)

运行时报告:

This is a matching text (contains all whitelist words)
        GOOD: three one two
This is a matching text (contains all whitelist words)
        GOOD: one three two six
This is not a matching text (lacks a whitelist word three)
        EVIL: two one
This is not a matching text (contains a blacklist word four)
        EVIL: one four two three

要调试正则表达式,包括查看两个trie数据结构如何编译和执行,只需在命令行中包含-Mre=debug或在代码中包含use re "debug";。这就是产生的结果:

% perl -Mre=debug /tmp/trie
Compiling REx "\s*#\s*(.*)$"
synthetic stclass "ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}]".
Final program:
   1: STAR (3)
   2:   SPACE (0)
   3: EXACT <#> (5)
   5: STAR (7)
   6:   SPACE (0)
   7: OPEN1 (9)
   9:   STAR (11)
  10:     REG_ANY (0)
  11: CLOSE1 (13)
  13: EOL (14)
  14: END (0)
floating "#" at 0..2147483647 (checking floating) stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] minlen 1 
Compiling REx "four|five"
Final program:
   1: EXACT <f> (3)
   3: TRIE-EXACT[io] (7)
      <our> 
      <ive> 
   7: END (0)
anchored "f" at 0 (checking anchored) minlen 4 
Compiling REx "one|two|three"
Final program:
   1: TRIEC-EXACT[ot] (11)
      <one> 
      <two> 
      <three> 
  11: END (0)
stclass AHOCORASICKC-EXACT[ot] minlen 3 
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "three one two # is a matching text (contains all whitelist w"...
Found floating substr "#" at offset 14...
start_shift: 0 check_at: 14 s: 0 endpos: 15
By STCLASS: moving 0 --> 5
Guessed: match at offset 5
Matching REx "\s*#\s*(.*)$" against " one two # is a matching text (contains all whitelist words)"...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " one two # is a matching text (contains all whitelist words)"... (61 bytes)
   5 <three> < one two #>    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   9 <e one> < two # is >    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
  13 <e two> < # is a ma>    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
  14 < two > <# is a mat>    |  3:  EXACT <#>(5)
  15 <two #> < is a matc>    |  5:  STAR(7)
                                    SPACE can match 1 times out of 2147483647...
  16 <wo # > <is a match>    |  7:    OPEN1(9)
  16 <wo # > <is a match>    |  9:    STAR(11)
                                      REG_ANY can match 49 times out of 2147483647...
  65 <list words)> <%n>      | 11:      CLOSE1(13)
  65 <list words)> <%n>      | 13:      EOL(14)
  65 <list words)> <%n>      | 14:      END(0)
Match successful!
This is a matching text (contains all whitelist words)
Guessing start of match in sv for REx "four|five" against "three one two%n"
Did not find anchored substr "f"...
Match rejected by optimizer
Compiling REx "\b(?^:one|two|three)\b"
Final program:
   1: BOUND (2)
   2: TRIEC-EXACT[ot] (13)
      <one> 
      <two> 
      <three> 
  13: BOUND (14)
  14: END (0)
stclass BOUND minlen 3 
Matching REx "\b(?^:one|two|three)\b" against "three one two%n"
Matching stclass BOUND against "three one tw" (12 bytes)
   0 <> <three one >         |  1:BOUND(2)
   0 <> <three one >         |  2:TRIEC-EXACT[ot](13)
   0 <> <three one >         |    State:    1 Accepted: N Charid:  4 CP:  74 After State:    5
   1 <t> <hree one t>        |    State:    5 Accepted: N Charid:  6 CP:  68 After State:    8
   2 <th> <ree one tw>       |    State:    8 Accepted: N Charid:  7 CP:  72 After State:    9
   3 <thr> <ee one two>      |    State:    9 Accepted: N Charid:  3 CP:  65 After State:    a
   4 <thre> <e one two>      |    State:    a Accepted: N Charid:  3 CP:  65 After State:    b
   5 <three> < one two%n>    |    State:    b Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #3, continuing
                                  only one match left, short-circuiting: #3 <three>
   5 <three> < one two%n>    | 13:BOUND(14)
   5 <three> < one two%n>    | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " one two%n"
Matching stclass BOUND against " one tw" (7 bytes)
   5 <three> < one two%n>    |  1:BOUND(2)
   5 <three> < one two%n>    |  2:TRIEC-EXACT[ot](13)
                                  failed to match trie start class...
   6 <hree > <one two%n>     |  1:BOUND(2)
   6 <hree > <one two%n>     |  2:TRIEC-EXACT[ot](13)
   6 <hree > <one two%n>     |    State:    1 Accepted: N Charid:  1 CP:  6f After State:    2
   7 <ree o> <ne two%n>      |    State:    2 Accepted: N Charid:  2 CP:  6e After State:    3
   8 <ree on> <e two%n>      |    State:    3 Accepted: N Charid:  3 CP:  65 After State:    4
   9 <ree one> < two%n>      |    State:    4 Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #1, continuing
                                  only one match left, short-circuiting: #1 <one>
   9 <ree one> < two%n>      | 13:BOUND(14)
   9 <ree one> < two%n>      | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " two%n"
Matching stclass BOUND against " tw" (3 bytes)
   9 <ree one> < two%n>      |  1:BOUND(2)
   9 <ree one> < two%n>      |  2:TRIEC-EXACT[ot](13)
                                  failed to match trie start class...
  10 <ree one > <two%n>      |  1:BOUND(2)
  10 <ree one > <two%n>      |  2:TRIEC-EXACT[ot](13)
  10 <ree one > <two%n>      |    State:    1 Accepted: N Charid:  4 CP:  74 After State:    5
  11 <ree one t> <wo%n>      |    State:    5 Accepted: N Charid:  5 CP:  77 After State:    6
  12 <ree one tw> <o%n>      |    State:    6 Accepted: N Charid:  1 CP:  6f After State:    7
  13 <ree one two> <%n>      |    State:    7 Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #2, continuing
                                  only one match left, short-circuiting: #2 <two>
  13 <ree one two> <%n>      | 13:BOUND(14)
  13 <ree one two> <%n>      | 14:END(0)
Match successful!
    GOOD: three one two
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "one three two six # is a matching text (contains all whiteli"...
Found floating substr "#" at offset 18...
start_shift: 0 check_at: 18 s: 0 endpos: 19
By STCLASS: moving 0 --> 3
Guessed: match at offset 3
Matching REx "\s*#\s*(.*)$" against " three two six # is a matching text (contains all whitelist "...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " three two six # is a matching text (contains all whitelist "... (67 bytes)
   3 <one> < three two>      |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   9 <three> < two six #>    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
  13 <e two> < six # is >    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
  17 <o six> < # is a ma>    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
  18 < six > <# is a mat>    |  3:  EXACT <#>(5)
  19 <six #> < is a matc>    |  5:  STAR(7)
                                    SPACE can match 1 times out of 2147483647...
  20 <ix # > <is a match>    |  7:    OPEN1(9)
  20 <ix # > <is a match>    |  9:    STAR(11)
                                      REG_ANY can match 49 times out of 2147483647...
  69 <list words)> <%n>      | 11:      CLOSE1(13)
  69 <list words)> <%n>      | 13:      EOL(14)
  69 <list words)> <%n>      | 14:      END(0)
Match successful!
This is a matching text (contains all whitelist words)
Guessing start of match in sv for REx "four|five" against "one three two six%n"
Did not find anchored substr "f"...
Match rejected by optimizer
Matching REx "\b(?^:one|two|three)\b" against "one three two six%n"
Matching stclass BOUND against "one three two si" (16 bytes)
   0 <> <one three >         |  1:BOUND(2)
   0 <> <one three >         |  2:TRIEC-EXACT[ot](13)
   0 <> <one three >         |    State:    1 Accepted: N Charid:  1 CP:  6f After State:    2
   1 <o> <ne three t>        |    State:    2 Accepted: N Charid:  2 CP:  6e After State:    3
   2 <on> <e three tw>       |    State:    3 Accepted: N Charid:  3 CP:  65 After State:    4
   3 <one> < three two>      |    State:    4 Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #1, continuing
                                  only one match left, short-circuiting: #1 <one>
   3 <one> < three two>      | 13:BOUND(14)
   3 <one> < three two>      | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " three two six%n"
Matching stclass BOUND against " three two si" (13 bytes)
   3 <one> < three two>      |  1:BOUND(2)
   3 <one> < three two>      |  2:TRIEC-EXACT[ot](13)
                                  failed to match trie start class...
   4 <one > <three two >     |  1:BOUND(2)
   4 <one > <three two >     |  2:TRIEC-EXACT[ot](13)
   4 <one > <three two >     |    State:    1 Accepted: N Charid:  4 CP:  74 After State:    5
   5 <one t> <hree two s>    |    State:    5 Accepted: N Charid:  6 CP:  68 After State:    8
   6 <ne th> <ree two si>    |    State:    8 Accepted: N Charid:  7 CP:  72 After State:    9
   7 <e thr> <ee two six>    |    State:    9 Accepted: N Charid:  3 CP:  65 After State:    a
   8 < thre> <e two six>     |    State:    a Accepted: N Charid:  3 CP:  65 After State:    b
   9 <three> < two six%n>    |    State:    b Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #3, continuing
                                  only one match left, short-circuiting: #3 <three>
   9 <three> < two six%n>    | 13:BOUND(14)
   9 <three> < two six%n>    | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " two six%n"
Matching stclass BOUND against " two si" (7 bytes)
   9 <three> < two six%n>    |  1:BOUND(2)
   9 <three> < two six%n>    |  2:TRIEC-EXACT[ot](13)
                                  failed to match trie start class...
  10 <hree > <two six%n>     |  1:BOUND(2)
  10 <hree > <two six%n>     |  2:TRIEC-EXACT[ot](13)
  10 <hree > <two six%n>     |    State:    1 Accepted: N Charid:  4 CP:  74 After State:    5
  11 <ree t> <wo six%n>      |    State:    5 Accepted: N Charid:  5 CP:  77 After State:    6
  12 <ree tw> <o six%n>      |    State:    6 Accepted: N Charid:  1 CP:  6f After State:    7
  13 <ree two> < six%n>      |    State:    7 Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #2, continuing
                                  only one match left, short-circuiting: #2 <two>
  13 <ree two> < six%n>      | 13:BOUND(14)
  13 <ree two> < six%n>      | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " six%n"
Matching stclass BOUND against " si" (3 bytes)
  13 <ree two> < six%n>      |  1:BOUND(2)
  13 <ree two> < six%n>      |  2:TRIEC-EXACT[ot](13)
                                  failed to match trie start class...
  14 <ree two > <six%n>      |  1:BOUND(2)
  14 <ree two > <six%n>      |  2:TRIEC-EXACT[ot](13)
                                  failed to match trie start class...
Contradicts stclass... [regexec_flags]
Match failed
    GOOD: one three two six
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "two one # is not a matching text (lacks a whitelist word thr"...
Found floating substr "#" at offset 8...
start_shift: 0 check_at: 8 s: 0 endpos: 9
By STCLASS: moving 0 --> 3
Guessed: match at offset 3
Matching REx "\s*#\s*(.*)$" against " one # is not a matching text (lacks a whitelist word three)"...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " one # is not a matching text (lacks a whitelist word three)"... (61 bytes)
   3 <two> < one # is >      |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   7 <o one> < # is not >    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
   8 < one > <# is not a>    |  3:  EXACT <#>(5)
   9 <one #> < is not a >    |  5:  STAR(7)
                                    SPACE can match 1 times out of 2147483647...
  10 <ne # > <is not a m>    |  7:    OPEN1(9)
  10 <ne # > <is not a m>    |  9:    STAR(11)
                                      REG_ANY can match 53 times out of 2147483647...
  63 <word three)> <%n>      | 11:      CLOSE1(13)
  63 <word three)> <%n>      | 13:      EOL(14)
  63 <word three)> <%n>      | 14:      END(0)
Match successful!
This is not a matching text (lacks a whitelist word three)
Guessing start of match in sv for REx "four|five" against "two one%n"
Did not find anchored substr "f"...
Match rejected by optimizer
Matching REx "\b(?^:one|two|three)\b" against "two one%n"
Matching stclass BOUND against "two on" (6 bytes)
   0 <> <two one%n>          |  1:BOUND(2)
   0 <> <two one%n>          |  2:TRIEC-EXACT[ot](13)
   0 <> <two one%n>          |    State:    1 Accepted: N Charid:  4 CP:  74 After State:    5
   1 <t> <wo one%n>          |    State:    5 Accepted: N Charid:  5 CP:  77 After State:    6
   2 <tw> <o one%n>          |    State:    6 Accepted: N Charid:  1 CP:  6f After State:    7
   3 <two> < one%n>          |    State:    7 Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #2, continuing
                                  only one match left, short-circuiting: #2 <two>
   3 <two> < one%n>          | 13:BOUND(14)
   3 <two> < one%n>          | 14:END(0)
Match successful!
Matching REx "\b(?^:one|two|three)\b" against " one%n"
Matching stclass BOUND against " on" (3 bytes)
   3 <two> < one%n>          |  1:BOUND(2)
   3 <two> < one%n>          |  2:TRIEC-EXACT[ot](13)
                                  failed to match trie start class...
   4 <two > <one%n>          |  1:BOUND(2)
   4 <two > <one%n>          |  2:TRIEC-EXACT[ot](13)
   4 <two > <one%n>          |    State:    1 Accepted: N Charid:  1 CP:  6f After State:    2
   5 <two o> <ne%n>          |    State:    2 Accepted: N Charid:  2 CP:  6e After State:    3
   6 <two on> <e%n>          |    State:    3 Accepted: N Charid:  3 CP:  65 After State:    4
   7 <two one> <%n>          |    State:    4 Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #1, continuing
                                  only one match left, short-circuiting: #1 <one>
   7 <two one> <%n>          | 13:BOUND(14)
   7 <two one> <%n>          | 14:END(0)
Match successful!
    EVIL: two one
Guessing start of match in sv for REx "\s*#\s*(.*)$" against "one four two three # is not a matching text (contains a blac"...
Found floating substr "#" at offset 19...
start_shift: 0 check_at: 19 s: 0 endpos: 20
By STCLASS: moving 0 --> 3
Guessed: match at offset 3
Matching REx "\s*#\s*(.*)$" against " four two three # is not a matching text (contains a blackli"...
Matching stclass ANYOF{i}[\x09\x0a\x0c\x0d #][{non-utf8-latin1-all}{unicode_all}] against " four two three # is not a matching text (contains a blackli"... (74 bytes)
   3 <one> < four two >      |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
   8 < four> < two three>    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
  12 <r two> < three # i>    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
                                  failed...
  18 <three> < # is not >    |  1:STAR(3)
                                  SPACE can match 1 times out of 2147483647...
  19 <hree > <# is not a>    |  3:  EXACT <#>(5)
  20 <ree #> < is not a >    |  5:  STAR(7)
                                    SPACE can match 1 times out of 2147483647...
  21 <ee # > <is not a m>    |  7:    OPEN1(9)
  21 <ee # > <is not a m>    |  9:    STAR(11)
                                      REG_ANY can match 55 times out of 2147483647...
  76 < word four)> <%n>      | 11:      CLOSE1(13)
  76 < word four)> <%n>      | 13:      EOL(14)
  76 < word four)> <%n>      | 14:      END(0)
Match successful!
This is not a matching text (contains a blacklist word four)
Guessing start of match in sv for REx "four|five" against "one four two three%n"
Found anchored substr "f" at offset 4...
Starting position does not contradict /^/m...
Guessed: match at offset 4
Matching REx "four|five" against "four two three%n"
   4 <one > <four two t>     |  1:EXACT <f>(3)
   5 <one f> <our two th>    |  3:TRIE-EXACT[io](7)
   5 <one f> <our two th>    |    State:    2 Accepted: N Charid:  2 CP:  6f After State:    3
   6 <ne fo> <ur two thr>    |    State:    3 Accepted: N Charid:  3 CP:  75 After State:    4
   7 <e fou> <r two thre>    |    State:    4 Accepted: N Charid:  4 CP:  72 After State:    5
   8 < four> < two three>    |    State:    5 Accepted: Y Charid:  0 CP:   0 After State:    0
                                  got 1 possible matches
                                  TRIE matched word #1, continuing
                                  only one match left, short-circuiting: #1 <our>
   8 < four> < two three>    |  7:END(0)
Match successful!
    EVIL: one four two three
Freeing REx: "one|two|three"
Freeing REx: "\s*#\s*(.*)$"

我知道没有其他语言允许你像这样调试你的正则表达式编译和执行,这本身就值得我书上的录取价格。