一行上的多个正则表达式匹配不起作用

时间:2014-10-01 17:26:26

标签: python regex

我有一些HTML,我想提取出以下文本块:

  • #|(竖线符号)
  • 开头
  • 后面是一些文字和一个'自动收报机'括号内的
  • 后面是所有文字,直到下一场比赛

示例代码:

text = """
#Test name 1 (ABCD) blah blah# some more text 1||Test name 2 (EFGH) blah blah some more text 2
#Test name 3 (IJKL) blah blah#  some more text 3
|Test name 4 (MNOP) blah blah||some more text 4
|Test name 5 (QRST) blah blah||some more text 5|
"""
expr = r'(?P<alltext>(#|\|)[^<>]+\((?P<ticker>[A-Z]{1,10})\)(?P<bodytext>.*))'
compiled_expr = re.compile( expr, re.MULTILINE)
matches = re.finditer(expr,text)
for match in matches:
    d=match.groupdict()
    print d['alltext']

示例输出

#Test name 1 (ABCD) blah blah# some more text 1||Test name 2 (EFGH) blah blah some more text 2
#Test name 3 (IJKL) blah blah#  some more text 3
|Test name 4 (MNOP) blah blah||some more text 4
|Test name 5 (QRST) blah blah||some more text 5|

这并没有在第一行拿起两场比赛。我需要的是它能够检测到测试名称2 ......&#39;

所以我想要的输出是:

#Test name 1 (ABCD) blah blah# some more text 1|
|Test name 2 (EFGH) blah blah some more text 2
#Test name 3 (IJKL) blah blah#  some more text 3
|Test name 4 (MNOP) blah blah||some more text 4
|Test name 5 (QRST) blah blah||some more text 5|

1 个答案:

答案 0 :(得分:2)

[#|][^#|]*?\(.*?\).*?(?=(?:[#|][^#|]*?\(.*?\))|$),带有单行修饰符(又名“点匹配所有”)。

Demo.

说明:

[#|] # match "#" or "|"
[^#|]*? # any text except "#" or "|", up until the next...
\( #..."("
.*? # any text enclosed in the braces
\) # and a closing brace
.*? # finally, any text until the next match OR the end of the string.
(?=
    (?: # this is the same pattern as before.
        [#|]
        [^#|]*?
        \(
        .*?
        \)
     )
|
    $
)