Question

在python中我有多行的大文本。我需要在{{book and}}之间获取文字我累了使用正则表达式问题是里面的文字是多字符串我试过{{book (.+)它只在第一行给我文字我试过{{book (.+) }}这会给出错误

re.search("{{book .*?}", pagetext).group()

我尝试了各种各样的表达方式......问题是如何在正则表达式中转到下一行......

lot of other text {{book series |name = Twilight |image = [[File:The twilight saga hardback.jpg|260px|]] |language = English<!-- Do not link, per WP: OVERLINK --> |genre = [[Romance (novel)|Romance]], [[fantasy literature|fantasy]], [[young-adult fiction]] |publisher = [[Little, Brown and Company]] |pub_date = 2005â€“2008 |media_type = Print }} <lot of other text >

Answer 1

您需要使用re.DOTALL标记以允许.捕获换行符。此外，您应该转义大括号，因为它们是Python正则表达式语法中的特殊字符。

re.search(r"\{\{book .*?\}\}", pagetext, re.DOTALL)

Answer 2

如果可能存在嵌套{{expr}}，那么正则表达式就不够了，例如：

pagetext = "start {{book with {{n{{e}}st{{e}}d t{{e}}xt}} t{{e}}xt}} {{e}}nd"
#XXX doesn't work: the text is truncated
print("Wrong:  %r" % re.search(r"\{\{book .*?\}\}", pagetext, re.DOTALL).group())
# -> Wrong:  '{{book with {{n{{e}}'

改编我的答案 get first paragraph from wikipedia article问题：

# extract everything from the first "{{book " to matching "}}"
prefix, sep, rest = pagetext.partition("{{book ")
if sep: # found the first "{{"
    depth = 1
    prevc = None
    for i, c in enumerate(rest):
        if c == "{" and  prevc == c:  # found "{{"
            depth += 1
            prevc = None # match "{{{ " only once
        elif c == "}" and prevc == c: # found "}}"
            depth -= 1
            if depth == 0: # found matching "}}"
                pagetext = sep + rest[:i+1] # include "}}"
                break
            prevc = None # match "}}} " only once
        else:
            prevc = c
print(pagetext)

输出

{{book with {{n{{e}}st{{e}}d t{{e}}xt}} t{{e}}xt}}

需要从python语言中的大块文本中提取多行文本

2 个答案:

输出