在python中我有多行的大文本。
我需要在{{book and}}之间获取文字
我累了使用正则表达式
问题是里面的文字是多字符串
我试过{{book (.+)
它只在第一行给我文字
我试过{{book (.+) }}
这会给出错误
re.search("{{book .*?}", pagetext).group()
我尝试了各种各样的表达方式......问题是如何在正则表达式中转到下一行......
lot of other text {{book series |name = Twilight |image = [[File:The twilight saga hardback.jpg|260px|]] |language = English<!-- Do not link, per WP: OVERLINK --> |genre = [[Romance (novel)|Romance]], [[fantasy literature|fantasy]], [[young-adult fiction]] |publisher = [[Little, Brown and Company]] |pub_date = 2005–2008 |media_type = Print }} <lot of other text >
答案 0 :(得分:1)
您需要使用re.DOTALL
标记以允许.
捕获换行符。此外,您应该转义大括号,因为它们是Python正则表达式语法中的特殊字符。
re.search(r"\{\{book .*?\}\}", pagetext, re.DOTALL)
答案 1 :(得分:0)
如果可能存在嵌套{{expr}}
,那么正则表达式就不够了,例如:
pagetext = "start {{book with {{n{{e}}st{{e}}d t{{e}}xt}} t{{e}}xt}} {{e}}nd"
#XXX doesn't work: the text is truncated
print("Wrong: %r" % re.search(r"\{\{book .*?\}\}", pagetext, re.DOTALL).group())
# -> Wrong: '{{book with {{n{{e}}'
改编我的答案 get first paragraph from wikipedia article问题:
# extract everything from the first "{{book " to matching "}}"
prefix, sep, rest = pagetext.partition("{{book ")
if sep: # found the first "{{"
depth = 1
prevc = None
for i, c in enumerate(rest):
if c == "{" and prevc == c: # found "{{"
depth += 1
prevc = None # match "{{{ " only once
elif c == "}" and prevc == c: # found "}}"
depth -= 1
if depth == 0: # found matching "}}"
pagetext = sep + rest[:i+1] # include "}}"
break
prevc = None # match "}}} " only once
else:
prevc = c
print(pagetext)
{{book with {{n{{e}}st{{e}}d t{{e}}xt}} t{{e}}xt}}