Question

file_txt = urllib.request.urlopen("ftp://ftp.sec.gov/edgar/data/1220985/0000930413-12-003922.txt")
file = file_txt.read().decode('cp1252')
soup = BeautifulSoup(file)
print(soup.prettify())
#UnicodeEncodeError: 'charmap' codec can't encode character '\x92' in position 11900: character maps to <undefined>

我检查了txt文件。当它显示在浏览器中时，\x92实际显示为&#146的HTML实体'。在使用与浏览器相同的编码方案（cp1252）解码后，我不确定为什么会发生错误。

Answer 1

通常BeautifulSoup擅长检测网页使用的编码，并且如果可用，则使用chardet库来执行此操作。因此，我建议您安装chardet包，让BeautifulSoup找出编码。

pip install chardet (or easy_install chardet)

希望这会有所帮助。

Answer 2

Beautiful Soup读取文档，但是当您尝试将其打印到控制台时会出现错误。这通常表示您的控制台无法显示某个字符。 This page on the Python wiki may help.

Python BeautifulSoup解码HTML实体

2 个答案: