我有一些HTML,我想提取出以下文本块:
#
或|
(竖线符号)示例代码:
text = """
#Test name 1 (ABCD) blah blah# some more text 1||Test name 2 (EFGH) blah blah some more text 2
#Test name 3 (IJKL) blah blah# some more text 3
|Test name 4 (MNOP) blah blah||some more text 4
|Test name 5 (QRST) blah blah||some more text 5|
"""
expr = r'(?P<alltext>(#|\|)[^<>]+\((?P<ticker>[A-Z]{1,10})\)(?P<bodytext>.*))'
compiled_expr = re.compile( expr, re.MULTILINE)
matches = re.finditer(expr,text)
for match in matches:
d=match.groupdict()
print d['alltext']
示例输出
#Test name 1 (ABCD) blah blah# some more text 1||Test name 2 (EFGH) blah blah some more text 2
#Test name 3 (IJKL) blah blah# some more text 3
|Test name 4 (MNOP) blah blah||some more text 4
|Test name 5 (QRST) blah blah||some more text 5|
这并没有在第一行拿起两场比赛。我需要的是它能够检测到测试名称2 ......&#39;
所以我想要的输出是:
#Test name 1 (ABCD) blah blah# some more text 1|
|Test name 2 (EFGH) blah blah some more text 2
#Test name 3 (IJKL) blah blah# some more text 3
|Test name 4 (MNOP) blah blah||some more text 4
|Test name 5 (QRST) blah blah||some more text 5|
答案 0 :(得分:2)
[#|][^#|]*?\(.*?\).*?(?=(?:[#|][^#|]*?\(.*?\))|$)
,带有单行修饰符(又名“点匹配所有”)。
说明:
[#|] # match "#" or "|"
[^#|]*? # any text except "#" or "|", up until the next...
\( #..."("
.*? # any text enclosed in the braces
\) # and a closing brace
.*? # finally, any text until the next match OR the end of the string.
(?=
(?: # this is the same pattern as before.
[#|]
[^#|]*?
\(
.*?
\)
)
|
$
)