Question

我基本上有一个用Python编写的RSS索引应用程序，它将RSS内容存储为数据库中的一个模糊。当应用程序最初处理文章内容时，它会注释掉所有与某些条件不匹配的链接，例如：

<a href="http://google.com">Google</a>

变成了：

<!--<a href="http://google.com">Google</a>--> Google

现在我需要处理所有这些旧文章并修改链接。因此，使用BeautifulSoup 4，我可以使用以下方法轻松找到评论：

links = soup.findAll(text=lambda text:isinstance(text, Comment))
for link in links:
    text = re.sub('<[^>]*>', '', link.string)
    # any html in the link tag was escaped by BS4, so need to convert back
    text = text.replace('&amp;lt;','<')
    text = text.replace('&amp;gt;','>')
    find = link.string + " " + text

上面“查找”的输出是：

<!--<a href="http://google.com">Google</a>--> Google

这样可以更轻松地对内容执行.replace()。

现在我遇到的问题（我确信这很简单）是多行查找/替换。当Beautiful Soup初始注释掉链接时，一些被转换为：

<!--<a href="http://google.com">Google
</a>--> Google

或

<!--<a href="http://google.com">Google</a>--> 
Google

很明显，replace(old,new)无效，因为replace()不包含多行。

有人可以通过正则表达式多行查找/替换来帮助我吗？它应该区分大小写。

Answer 1

试试这个：

 re.sub(r'pattern', '', link, flags=re.MULTILINE)

默认情况下，正则表达式匹配区分大小写。

如果由于某种原因RSS文件变得不规则，您的脚本将失败。在这种情况下，您应该考虑使用适当的解析器，例如lxml。

美丽的汤/正则表达式匹配多行

1 个答案: