Python - 无法解码html(urllib)

时间:2018-01-29 16:58:48

标签: python html python-3.x character-encoding urllib

我正在尝试将html从网页写入文件,但我解码字符有问题:

import urllib.request

response = urllib.request.urlopen("https://www.google.com")

charset = response.info().get_content_charset()
print(response.read().decode(charset))

最后一行导致错误:

Traceback (most recent call last):
  File "script.py", line 7, in <module>
    print(response.read().decode(charset))
UnicodeEncodeError: 'ascii' codec can't encode character '\u015b' in 
position 6079: ordinal not in range(128)

response.info().get_content_charset()返回iso-8859-2,但是如果我检查响应的内容而不解码(print(resposne.read())),则将“utf-8”编码作为html metatag。如果我在解码功能中使用“utf-8”也有类似的问题:

Traceback (most recent call last):
  File "script.py", line 7, in <module>
    print(response.read().decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 
6111: invalid start byte

发生了什么事?

1 个答案:

答案 0 :(得分:0)

您可以使用

忽略无效字符
response.read().decode("utf-8", 'ignore')

而不是ignore还有其他选项,例如replace

https://www.tutorialspoint.com/python/string_encode.htm

https://docs.python.org/3/howto/unicode.html#the-string-type

(字符串也有str.encode(encoding='UTF-8',errors='strict')。)