Question

我使用Python和BeautifulSoup 4库处理HTML，我找不到用空格替换 的明显方法。相反，它似乎被转换为Unicode非破坏空格字符。

我错过了一些明显的东西吗？更换＆amp; nbsp;的最佳方式是什么？使用BeautifulSoup的正常空间？

编辑添加我使用的是最新版本BeautifulSoup 4，因此Beautiful Soup 3中的convertEntities=BeautifulSoup.HTML_ENTITIES选项不可用。

Answer 1

>>> soup = BeautifulSoup('<div>a&nbsp;b</div>')
>>> soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' '))
u'<html>\n <body>\n  <div>\n   a b\n  </div>\n </body>\n</html>'

Answer 2

请参阅文档中的Entities。 BeautifulSoup 4为所有实体生成适当的Unicode：

传入的HTML或XML实体始终转换为相应的Unicode字符。

是的， 变成了一个不间断的空格字符。如果你真的希望那些是空格字符，你将不得不做一个unicode替换。

Answer 3

我只需用unicode替换不间断的空格。

nonBreakSpace = u'\xa0'
soup = soup.replace(nonBreakSpace, '')

一个好处是，即使您使用的是BeautifulSoup，也不需要。

Answer 4

我遇到json的问题，汤.prettify（）无法修复，因此可以与unicodedata.normalize()一起使用：

var pattern = "(?<=:)[a-zA-Z0-9 ]+";

soup = BeautifulSoup(r.text, 'html.parser')
dat = soup.find('span', attrs={'class': 'date'})
print(f"date prints fine:'{dat.text}'")
print(f"json:{json.dumps(dat.text)}")
mydate = unicodedata.normalize("NFKD",dat.text)
print(f"json after normalizing:'{json.dumps(mydate)}'")

Answer 5

诚然，这不是使用 BeautifulSoup，但今天更直接的解决方案可能是 html.unescape 和 unicodedata.normalize 的某种组合，具体取决于您的数据和您想要做什么。

>>> from html import unescape
>>> s = unescape('An enthusiastic member of the&nbsp;community')# Using the import here
>>> print(s)
>>> 'An enthusiastic member of the\xa0community'
>>> import unicodedata
>>> s = unicodedata.normalize('NFKC', s)
>>> print(s)
>>> 'An enthusiastic member of the community'

如何使用BeautifulSoup 4替换或删除像“”这样的HTML实体

5 个答案: