如何在python

时间:2017-08-23 17:08:49

标签: python python-2.7 encoding urllib2 windows-1255

我正在尝试获取并解析包含非ASCII字符的网页(网址为http://www.one.co.il)。这就是我所拥有的:

url = "http://www.one.co.il"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
encoding = response.headers.getparam('charset') # windows-1255
html = response.read() # The length of this is valid - about 31000-32000,
                       # but printing the first characters shows garbage -
                       # '\x1f\x8b\x08\x00\x00\x00\x00\x00', instead of
                       # '<!DOCTYPE'
html_decoded = html.decode(encoding)

最后一行给了我一个例外:

File "C:/Users/....\WebGetter.py", line 16, in get_page
  html_decoded = html.decode(encoding)
File "C:\Python27\lib\encodings\cp1255.py", line 15, in decode
  return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0xdb in position 14: character maps to <undefined>

我尝试查看其他相关问题,例如urllib2 read to UnicodeHow to handle response encoding from urllib.request.urlopen(),但没有找到任何有用的信息。

有人可以在这个问题上提供一些启示并引导我吗?谢谢!

1 个答案:

答案 0 :(得分:1)

0x1f 0x8b 0x08是gzip压缩文件的幻数。在使用内容之前,您需要将其解压缩。