如何替换文本块中的多个正则表达式匹配

时间:2018-12-20 21:54:51

标签: python regex

我正在尝试将一个段落中的多个匹配项转换为链接,同时在最终输出中保留周围的文本。我所匹配的模式让人想起Markdown的超链接语法,这是一种允许非技术用户定义他们想要在输入中链接什么文本的方式(我正在通过Sheets API / Python访问的Google Sheet)。我捕获的第一个组是链接的文本,第二个是查询字符串中键的值。

我已经能够成功匹配该模式的单个实例,但是我的替换字符串将替换输出中的整个段落。

text = "2018 was a big year for my sourdough starter and me. Mostly 
we worked on developing this [tangy bread](19928) and these [chewy 
rolls] (9843). But we were also just content keeping each other 
company and inspired to bake."

def link_inline(text):
    # expand a proper link around recipe id
    ref = re.search(r"(\[.*?\]\(\d+\))", text, re.MULTILINE).group(1)
    if (len(ref) > 0):
        link = re.sub("\[(.*?)\]\((\d+)\)", r"<a href='https://www.foo.com/recipes?rid=\2'>\1</a>", ref)
        return text
    else:
        return "replacement failed"

目标是使此输出保持段落完整,并简单地将\[(.*?)\]\((\d+)\)模式匹配替换为以下字符串,包括对组的反向引用:<a href="https://www.foo.com?bar=\2">\1</a>

因此,它将需要遍历文本以替换所有匹配项(大概用re.finditer?),并且还需要在模式匹配项之外保留原始文本。但是我不确定如何正确定义循环并执行此替换操作,而不会仅用替换字符串覆盖整个段落。

1 个答案:

答案 0 :(得分:0)

我使用了re.compile,而不是在整个组中加上括号,而是在.*?周围放置了一对,而在\d+周围放置了另一对,因为这两个部分代表了文本我们要提取并放入我们的URL。

import re

def link_inline(text):
    # expand a proper link around recipe id
    ref = re.compile("\[(.*?)\]\((\d+)\)")
    replacer = r'<a href="https://www.foo.com/recipes?rid=\2">\1</a>'
    return ref.sub(replacer, text)


text = """
2018 was a big year for my sourdough starter and me. Mostly we worked on
 developing this [tangy bread](19928) and
 these [chewy rolls](9843). But we were also just
 content keeping each other company and inspired to bake.
"""

print(link_inline(text))

输出:

2018 was a big year for my sourdough starter and me. Mostly we worked on
 developing this <a href="https://www.foo.com/recipes?rid=19928">tangy bread</a> and
 these <a href="https://www.foo.com/recipes?rid=9843">chewy rolls</a>. But we were also just
 content keeping each other company and inspired to bake.

作为一个健全性检查,我尝试使用一些不是括号的括号和方括号来插入一些多余的字符串,例如字符串(this) here中的[this] heretext。一切仍然顺利。