Question

我正在尝试将html从网页写入文件，但我解码字符有问题：

import urllib.request

response = urllib.request.urlopen("https://www.google.com")

charset = response.info().get_content_charset()
print(response.read().decode(charset))

最后一行导致错误：

Traceback (most recent call last):
  File "script.py", line 7, in <module>
    print(response.read().decode(charset))
UnicodeEncodeError: 'ascii' codec can't encode character '\u015b' in 
position 6079: ordinal not in range(128)

response.info().get_content_charset()返回iso-8859-2，但是如果我检查响应的内容而不解码（print(resposne.read())），则将“utf-8”编码作为html metatag。如果我在解码功能中使用“utf-8”也有类似的问题：

Traceback (most recent call last):
  File "script.py", line 7, in <module>
    print(response.read().decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 
6111: invalid start byte

发生了什么事？

Answer 1

您可以使用

忽略无效字符

response.read().decode("utf-8", 'ignore')

而不是ignore还有其他选项，例如replace

https://www.tutorialspoint.com/python/string_encode.htm

https://docs.python.org/3/howto/unicode.html#the-string-type

（字符串也有str.encode(encoding='UTF-8',errors='strict')。）

Python - 无法解码html（urllib）

1 个答案: