Question

现在我使用beautifulsoup来处理html。当我使用replace_with（）时，它会返回此结果。它会逃脱我的＆＃39;＆lt;＆＃39;和＆＃39;＆gt;＆＃39;。

>>> tt = bs('<p><a></a></p>')

>>> bb = tt.p

>>> tt

<html><body><p><a></a></p></body></html>

>>> bb

<p><a></a></p>

>>> bb.replace_with('<p>aaaaaaa<aaaaa></p>')

<p><a></a></p>

>>> tt

<html><body>&lt;p&gt;aaaaaaa&lt;aaaaa&gt;&lt;/p&gt;</body></html>

我希望输出如下：

>>> tt

<html><body><p>aaaaaaa<aaaaa></p></body></html>

我该怎么办？ 3Q
---------更新--------------------------
在这里，我正在用python编写一个程序，用于将你的html博客转换为markdown。它的代码是here。我的主要方法是：
1使用urllib2抓取页面代码
2使用beautifulSoup解析dom树
3使用beautifulSoup修改exisit dom树（这里我使用bs.replace_with）
4将修改后的dom树保存到markdown文件

问题是beautifulSoup会自动解决＆＃39;＆lt;＆＃39;和＆＃39;＆gt;＆＃39;当我修改dom树时。这意味着dom树被修改不像我预期的那样。 html是

 service tool->SQL Server Reporting Services

降价是

 service tool-&gt;SQL Server Reporting Services

Answer 1

from bs4 import BeautifulSoup
tt = BeautifulSoup('<p><a></a></p>')

new = BeautifulSoup('<p>aaaaaaa<aaaaa></p>')
tt.p.replace_with(new.p)

使用您自己的代码，您可以使用output formatter查看所需的输出：

from bs4 import BeautifulSoup
tt = BeautifulSoup('<p><a></a></p>')
tt.p.replace_with('<p>aaaaaaa<aaaaa></p>')
print(tt.prettify(formatter=None))
<html>
 <body>
  <p>aaaaaaa<aaaaa></p>
 </body>
</html>

您也可以替换标签内的字符串，但我不完全确定您想要实现的目标，但documentation非常明确且易于理解。

如何避免replace_with逃避我的'＆lt;'和'＆gt;'？

1 个答案: