Python bs4删除了br标签

时间:2016-04-24 09:27:21

标签: python beautifulsoup html-parsing bs4

我使用bs4来操作一些富文本。但它删除了我做字符转换的内部的br标签。下面是代码的简单形式。

import re
from bs4 import BeautifulSoup

#source_code = self.textInput.toHtml()
source_code =  """.......<p style=" margin-top:12px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;"><span style=" font-family:'Ubuntu';">ABC ABC<br />ABC</span></p>.......""" 

soup = BeautifulSoup(source_code, "lxml")

for elm in soup.find_all('span', style=re.compile(r"font-family:'Ubuntu'")):
#actually there was a for loop
    elm.string = elm.text.replace("A", "X")
    elm.string = elm.text.replace("B", "Y")
    elm.string = elm.text.replace("C", "Z")

print(soup.prettify())

这应该输出为

...<span style=" font-family:'Ubuntu';">XYZ XYZ<br />XYZ</span>...
#XYZ XYZ
#XYZ

但它提供的输出没有br标签。

...<span style=" font-family:'Ubuntu';">XYZ XYZXYZ</span>...
#XYZ XYZXYZ

我怎么能纠正这个?

1 个答案:

答案 0 :(得分:2)

问题是你正在重新定义元素的.string,但我会找到&#34; text&#34;节点并在那里进行替换:

for text in elm.find_all(text=True):
    text.replace_with(text.replace("A", "X").replace("B", "Y").replace("C", "Z"))

适合我,生产:

</p>
  <p style=" margin-top:12px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">
   <span style=" font-family:'Ubuntu';">
    XYZ XYZ
    <br/>
    XYZ
   </span>
</p>
  

如何在循环中包含此部分?

以下是一个示例:

replacements = {
    "A": "X",
    "B": "Y",
    "C": "Z"
}
for text in elm.find_all(text=True):
    text_to_replace = text
    for k, v in replacements.items():
        text_to_replace = text_to_replace.replace(k, v)

    text.replace_with(text_to_replace)