使用urllib2(python 2.7)进行爬行时出现乱码

时间:2014-12-09 03:49:05

标签: python python-2.7 urllib2 urllib

我使用了urllib2,但响应是这样的:

'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00\xed\x9d{s\xe2\xb8\x9a\x87\xff\xeeSu\xbe\x83\x0e\xc5\xce\xf4L\x11\xe2\x0b\xd7\x99\xee\xec*`\xc0\x1d_86\x84N\xb6\xb6\xa6\x1cp\x82\xa7\tfl\x93t\xce\xa7_\xc9@ \xc4\x18\xc5\t\xf1\xa8\xad\xae\x99t\xdb\xb1\xe5\xd7\x92~\xef\xa3W7\x7f\xfaWSo\xf4.\xba\x12\x18\x07\xb7\x13\xd0\xed\x9f*r\x03\xe4\x8e\x8e\x8f\x07b\xe3\xf8\xb8\xd9k\x82\xaf\x9d\x9e\xaa\x00\xbe\xc8\x81\x9egM}\'p\xdc\xa959>\x96\xb4\x1c\xc8\x8d\ 

我的代码是:

url = "http://fsr.merckresponsibility.com/fsr/service.do?"
params = {"page": 2, "sort": "name", "descending": "asc", "letter": "all", "keytype": "", "keywords": "", "rows": 80}
params = urllib.urlencode(params)
header = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
          "Accept-Encoding": "gzip, deflate, sdch",
          "Accept-Language": "en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4",
          "Connection": "keep-alive",
          "Cache-Control": "max-age=0",
          "Host": "fsr.merckresponsibility.com",
          "Cookie": "JSESSIONID=5D0AB9801BC9B522B043FC10C1705AF1.st3024;unique_visitor=60.254.142.39.1418022044678466; BIGipServerDMZ-04-Shared-HTTP=2926383",
          "Referer": "http://fsr.merckresponsibility.com/fsr/service.do",
          "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/39.0.2171.71 Safari/537.36"}

req = urllib2.Request(url + params, headers=header)
response = urllib2.urlopen(req).read()

有谁能告诉我哪里出错了?

1 个答案:

答案 0 :(得分:0)

代码正在传递"Accept-Encoding":"gzip, deflate, sdch"标头,导致服务器使用gzip对内容进行编码。

删除该标题将解决您的问题。


如果您要使用gzip编码,则需要使用gzip module解压缩回复:

...
response = urllib2.urlopen(req).read()

import gzip
import StringIO
f = StringIO.StringIO(response)
zf = gzip.GzipFile(fileobj=f)
response = zf.read()