Question

我有一些HTML，我想提取出以下文本块：

以#或|（竖线符号）
后面是一些文字和一个＆＃39;自动收报机＆＃39;括号内的
后面是所有文字，直到下一场比赛

示例代码：

text = """
#Test name 1 (ABCD) blah blah# some more text 1||Test name 2 (EFGH) blah blah some more text 2
#Test name 3 (IJKL) blah blah#  some more text 3
|Test name 4 (MNOP) blah blah||some more text 4
|Test name 5 (QRST) blah blah||some more text 5|
"""
expr = r'(?P<alltext>(#|\|)[^<>]+\((?P<ticker>[A-Z]{1,10})\)(?P<bodytext>.*))'
compiled_expr = re.compile( expr, re.MULTILINE)
matches = re.finditer(expr,text)
for match in matches:
    d=match.groupdict()
    print d['alltext']

示例输出

#Test name 1 (ABCD) blah blah# some more text 1||Test name 2 (EFGH) blah blah some more text 2
#Test name 3 (IJKL) blah blah#  some more text 3
|Test name 4 (MNOP) blah blah||some more text 4
|Test name 5 (QRST) blah blah||some more text 5|

这并没有在第一行拿起两场比赛。我需要的是它能够检测到测试名称2 ......＆＃39;

所以我想要的输出是：

#Test name 1 (ABCD) blah blah# some more text 1|
|Test name 2 (EFGH) blah blah some more text 2
#Test name 3 (IJKL) blah blah#  some more text 3
|Test name 4 (MNOP) blah blah||some more text 4
|Test name 5 (QRST) blah blah||some more text 5|

Answer 1

[#|][^#|]*?$.*?$.*?(?=(?:[#|][^#|]*?$.*?$)|$)，带有单行修饰符（又名“点匹配所有”）。

Demo.

说明：

[#|] # match "#" or "|"
[^#|]*? # any text except "#" or "|", up until the next...
\( #..."("
.*? # any text enclosed in the braces
\) # and a closing brace
.*? # finally, any text until the next match OR the end of the string.
(?=
    (?: # this is the same pattern as before.
        [#|]
        [^#|]*?
        \(
        .*?
        \)
     )
|
    $
)

一行上的多个正则表达式匹配不起作用

1 个答案: