对于使用urllib
的以下行:
# some request object exists
response = urllib.request.urlopen(request)
html = response.read().decode("utf8")
read()
返回的字符串格式是什么?我一直试图从Python的文档中找到它,但它根本没有提到它。为什么会有decode
? decode
是否将对象解码为 utf-8或从 utf-8解码?从什么格式到它将它解码为什么格式? decode
文档也没有提到这一点。是Python的文档是那么可怕,还是我不理解某些标准惯例?
我想将该HTML存储在UTF-8文件中。我会做一个常规的写作,还是我需要“编码”回某些东西并写出来?
注意:我知道urllib已被弃用,但我现在无法切换到urllib2
答案 0 :(得分:1)
问问python:
>>> r=urllib.urlopen("http://google.com")
>>> a=r.read()
>>> type(a)
0: <type 'str'>
>>> help(a.decode)
Help on built-in function decode:
decode(...)
S.decode([encoding[,errors]]) -> object
Decodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
as well as any other name registered with codecs.register_error that is
able to handle UnicodeDecodeErrors.
>>> b = a.decode('utf8')
>>> type(b)
1: <type 'unicode'>
>>>
因此,read()
似乎会返回str
。 .decode()
将从 UTF-8解码为Python的内部unicode格式。