我正在尝试将html从网页写入文件,但我解码字符有问题:
import urllib.request
response = urllib.request.urlopen("https://www.google.com")
charset = response.info().get_content_charset()
print(response.read().decode(charset))
最后一行导致错误:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode(charset))
UnicodeEncodeError: 'ascii' codec can't encode character '\u015b' in
position 6079: ordinal not in range(128)
response.info().get_content_charset()
返回iso-8859-2
,但是如果我检查响应的内容而不解码(print(resposne.read())
),则将“utf-8”编码作为html metatag。如果我在解码功能中使用“utf-8”也有类似的问题:
Traceback (most recent call last):
File "script.py", line 7, in <module>
print(response.read().decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position
6111: invalid start byte
发生了什么事?
答案 0 :(得分:0)
您可以使用
忽略无效字符response.read().decode("utf-8", 'ignore')
而不是ignore
还有其他选项,例如replace
https://www.tutorialspoint.com/python/string_encode.htm
https://docs.python.org/3/howto/unicode.html#the-string-type
(字符串也有str.encode(encoding='UTF-8',errors='strict')
。)