Python正则表达式匹配选项“\ *”(字面Asterisk字符)或“\ s”(空格)

时间:2014-02-04 04:59:59

标签: python regex

我试图匹配" D"线条和将类似数据集中的字符 2,3,4和5捕获到:

S    7....                        <- line 1
         associated random data   <- line 2
D*EX 0....                        <- line 3
         associated random data   <- line 4
C    0....                        <- line 5
         associated random data   <- line 6
D E  6....                        <- line 7
         associated random data   <- line 8
         associated random data   <- line 9
D    3....                        <- line 10
         associated random data   <- line 11
D O  3....                        <- line 12
         associated random data   <- line 13
         associated random data   <- line 14

即。我不想只是捕获^ D. *&#34; EX&#34;字符可以改变,我以后需要区分它们。

我遇到的问题似乎是&#34; *&#34;之间的选择。和一个&#34; &#34; (空格)在第二个字符(列)中。

但是在&#34; *&#34;之间指定选择;和&#34; \ s&#34;似乎在线上没有匹配&#34; D * EX 0 ....&#34;

re.compile(r'''^(^[D]               # Match "D"
                [\*|\s]         <-- # Match either "*" or " "
                [A-Z{1,2}\s|\s{3}]  # match either "EX" + "" OR match 3x" "
.*?)^[A-Z]''', re.DOTALL | re.MULTILINE |re.VERBOSE)  # match anything else if there...

匹配和输出=&gt; D EX 6....D 3....

如果我隐含地指定&#34; *&#34;,我确实最终得到一个行匹配,但当然其他行不匹配。

re.compile(r'''^(^[D]               # Match "D"
                [\*]            <-- # Match ONLY "*"
                [A-Z{1,2}\s|\s{3}]  # match either "EX" + "" OR match 3x" "
.*?)^[A-Z]''', re.DOTALL | re.MULTILINE |re.VERBOSE)  # match anything else if there...

仅限匹配和输出=&gt; D*EX 0....

有人建议我尝试使用非捕获组,虽然NC组是新的,但对我来说有点意义,我可能仍然希望捕获的输出和NC组之间的原始选择&# 34; *&#34;和&#34; \ s&#34;,仍然不匹配。我玩了很多组合,但输出与下面的一致。

re.compile(r'''^(^[D]               # Match "D"
                (?:[\*|\s]      <-- # non-capturing group match either "*" or " "
                [A-Z{1,2}\s|\s{3}]  # match either "EX" + "" OR match 3x" "
.*?)^[A-Z]''', re.DOTALL | re.MULTILINE |re.VERBOSE)  # match anything else if there...

匹配和输出=&gt; D EX 0....D 0....

赞赏任何建议/建议;我在这里转圈:O

1 个答案:

答案 0 :(得分:1)

以下是设置:

import re

txt = '''S    7....                        <- line 1
         associated random data   <- line 2
D*EX 0....                        <- line 3
         associated random data   <- line 4
C    0....                        <- line 5
         associated random data   <- line 6
D E  6....                        <- line 7
         associated random data   <- line 8
         associated random data   <- line 9
D    3....                        <- line 10
         associated random data   <- line 11
D O  3....                        <- line 12
         associated random data   <- line 13
         associated random data   <- line 14'''

flags = re.DOTALL | re.MULTILINE |re.VERBOSE

以下是一些示例用法:

re1 = re.compile('''^(D.*?)\d''', flags)    
print re.findall(re1, txt)

返回:

['D*EX ', 'D E  ', 'D    ', 'D O  ']

我意识到你可能想要所有相关的随机数据,如果你想要这一切,中间所有这些东西都是无关紧要的,最重要的是最终部分:

消除多行标志

flags = re.DOTALL | re.VERBOSE

现在从每个新行的开头开始,在紧随其后的D中查找,然后在不合理地将其与另一个字符或字符串结尾的新行的书挡后捕获。

re1 = re.compile(
  r'''(?:^|\n) # noncapturing, assert start of string or newline
      (D.*?)   # capture D and everything after it
      (?=\n[A-Z]|$) #lookahead, newline cap char or end of string?
  ''', flags)


for i in  re.findall(re1, txt):
    print i

打印哪些:

D*EX 0....                        <- line 3
         associated random data   <- line 4
D E  6....                        <- line 7
         associated random data   <- line 8
         associated random data   <- line 9
D    3....                        <- line 10
         associated random data   <- line 11
D O  3....                        <- line 12
         associated random data   <- line 13
         associated random data   <- line 14

这就是你要找的东西。

<强>后记

作为一个附言,在放弃之前,我使用多线在兔子洞的远处。也许你可以看到你做错了什么。

^((D[\*\s]([A-Z]\s{2}|[A-Z]{2}\s|\s{3}).*)$(?!^\n[A-Z]))

首先,不要在方括号内使用管道,除非你想要它们作为字面意思。