Question

我正在使用bs4对某些文字进行一些处理，但在某些情况下，它会将 个字符转换为Â。我能说的最好的是这是encoding mismatch from UTF-8 to latin1（或反向？）

我的网络应用程序中的所有内容都是UTF-8，Python3是UTF-8，我已经确认数据库是UTF-8。

我已将问题缩小到这一行：

print("Before soup: " + text)  # Before soup: &nbsp;
soup = BeautifulSoup(text, "html.parser")
#.... do stuff to soup, but all commented out for this testing.
soup = BeautifulSoup(soup.renderContents(), "html.parser")  # <---- PROBLEM!
print(soup.renderContents())  # b'\xc3\x82\xc2\xa0'
print("After SOUP: " + str(soup))  # After SOUP: Â

如何阻止renderContents（）更改编码？这个函数有no documentation！

编辑：进一步研究文档后，this seems to be the key，但我仍然无法解决问题！

print(soup.prettify(formatter="html"))  # &Acirc;&nbsp;

Answer 1

好的，显然我没有仔细阅读文档，这里可以找到答案：

来自https://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings：

问题是提供给BS的代码片段太短，而BeautifulSoup的子库Unicode, Dammit没有足够的信息来正确猜测编码。

Unicode, Dammit大部分时间都在正确猜测，但有时也是如此犯错误。你可以避免将错误和延迟传递给BeautifulSoup构造函数 from_encoding。

所以关键是每次构建BS时都要添加from_encoding="UTF-8"：

soup = BeautifulSoup(soup.renderContents(), "html.parser", from_encoding="UTF-8")

阻止BeautifulSoup的renderContents（）更改为

1 个答案: