Question

我使用Python与BeautifulSoap进行网络抓取我收到此错误

'charmap' codec can't encode character '\xae' in position 69: character maps to <undefined>

抓取网页时

这是我的Python

hotel = BeautifulSoup(state.)
print (hotel.select("div.details.cf span.hotel-name a"))
# Tried:  print (hotel.select("div.details.cf span.hotel-name a")).encode('utf-8')

Answer 1

当我们尝试.encode()已编码的字节字符串时，我们通常会遇到此问题。所以你可能会尝试先解码它，就像在

中一样

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

举个例子：

html = '\xae'
encoded_str = html.encode("utf8")

失败

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

虽然：

html = '\xae'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
print encoded_str
®

成功没有错误。请注意＆＃34; windows-1252＆＃34;是我用作示例的东西。我从chardet得到了这个，它有0.5信心，它是正确的！（好吧，用1个字符长度的字符串给出，你期望什么）你应该将它改为从.urlopen().read()返回的字节串的编码，以适用于你检索的内容。

'charmap'编解码器在搜索网页时无法对字符'\ xae'进行编码

1 个答案: