需要从python语言中的大块文本中提取多行文本

时间:2011-11-13 13:01:55

标签: python regex

在python中我有多行的大文本。 我需要在{{book and}}之间获取文字 我累了使用正则表达式 问题是里面的文字是多字符串 我试过{{book (.+)它只在第一行给我文字 我试过{{book (.+) }}这会给出错误

re.search("{{book .*?}", pagetext).group()

我尝试了各种各样的表达方式......问题是如何在正则表达式中转到下一行......

lot of other text {{book series |name = Twilight |image = [[File:The twilight saga hardback.jpg|260px|]] |language = English<!-- Do not link, per WP: OVERLINK --> |genre = [[Romance (novel)|Romance]], [[fantasy literature|fantasy]], [[young-adult fiction]] |publisher = [[Little, Brown and Company]] |pub_date = 2005–2008 |media_type = Print }} <lot of other text >

2 个答案:

答案 0 :(得分:1)

您需要使用re.DOTALL标记以允许.捕获换行符。此外,您应该转义大括号,因为它们是Python正则表达式语法中的特殊字符。

re.search(r"\{\{book .*?\}\}", pagetext, re.DOTALL)

答案 1 :(得分:0)

如果可能存在嵌套{{expr}},那么正则表达式就不够了,例如:

pagetext = "start {{book with {{n{{e}}st{{e}}d t{{e}}xt}} t{{e}}xt}} {{e}}nd"
#XXX doesn't work: the text is truncated
print("Wrong:  %r" % re.search(r"\{\{book .*?\}\}", pagetext, re.DOTALL).group())
# -> Wrong:  '{{book with {{n{{e}}'

改编我的答案 get first paragraph from wikipedia article问题:

# extract everything from the first "{{book " to matching "}}"
prefix, sep, rest = pagetext.partition("{{book ")
if sep: # found the first "{{"
    depth = 1
    prevc = None
    for i, c in enumerate(rest):
        if c == "{" and  prevc == c:  # found "{{"
            depth += 1
            prevc = None # match "{{{ " only once
        elif c == "}" and prevc == c: # found "}}"
            depth -= 1
            if depth == 0: # found matching "}}"
                pagetext = sep + rest[:i+1] # include "}}"
                break
            prevc = None # match "}}} " only once
        else:
            prevc = c
print(pagetext)

输出

{{book with {{n{{e}}st{{e}}d t{{e}}xt}} t{{e}}xt}}