Question

我正在使用urlfetch来获取网址。当我尝试将其发送到html2text函数（剥离所有HTML标记）时，我收到以下消息：

UnicodeEncodeError: 'charmap' codec can't encode characters in position  ... character maps to <undefined>

我一直在尝试处理字符串上的编码（'UTF-8'，'忽略'），但我一直收到这个错误。

有什么想法吗？

谢谢，

乔尔

一些代码：

result = urlfetch.fetch(url="http://www.google.com")
html2text(result.content.encode('utf-8', 'ignore'))

错误信息：

File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 159-165: character maps to <undefined>

Answer 1

您需要解码您首先获取的数据！用哪个编解码器？取决于您获取的网站。

当你有unicode并尝试使用some_unicode.encode('utf-8', 'ignore')对其进行编码时，我无法想象它是如何引发错误的。

好的，你需要做什么：

result = fetch('http://google.com') 
content_type = result.headers['Content-Type'] # figure out what you just fetched
ctype, charset = content_type.split(';')
encoding = charset[len(' charset='):] # get the encoding
print encoding # ie ISO-8859-1
utext = result.content.decode(encoding) # now you have unicode
text = utext.encode('utf8', 'ignore') # encode to uft8

这不是很强大但它应该向你展示道路。

获取URL时的UnicodeEncodeError

1 个答案: