urllib2.urlopen在urllib.urlopen处理相同的URL时失败

时间:2013-11-28 19:53:27

标签: python http python-2.7 urllib2 urllib

我正在尝试使用urllib和urllib2来抓取特定网站的一些数据。

现在urllib主要用于读取和处理数据,而urllib2的代码部分主要用于读取和存储数据。

外部网站经历了一些更改,而urllib代码部分保持正常工作,urllib2部分只是简单地开始。

所以我做了一些检查,发现urllib2.urlopen(URL)总是返回一个空字符串,而urllib.urlopen(URL)总是正常工作。

我深入挖掘并在urllib和urllib模块上启用调试日志记录:

 >>> response2 =urllib2.urlopen('http://www.xxxxxxxxltd.com/web/guest/attendancelist')
send: 'GET /web/guest/attendancelist HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.xxxxxxxxltd.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 302 Moved Temporarily\r\n'
header: Server: nginx/0.7.67
header: Date: Thu, 28 Nov 2013 19:12:28 GMT
header: Transfer-Encoding: chunked
header: Connection: close
header: Location: http://www.xxxxxxxxplc.com/web/guest/attendancelist
send: 'GET /web/guest/attendancelist HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.xxxxxxxxplc.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 301 Moved Permanently\r\n'
header: Server: Apache-Coyote/1.1
header: Location: /home/new/attendancelist.jsp
header: Content-Length: 0
header: Date: Thu, 28 Nov 2013 19:12:26 GMT
header: Connection: close
send: 'GET /home/new/attendancelist.jsp HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.xxxxxxxxplc.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: Apache-Coyote/1.1
header: Set-Cookie: JSESSIONID=F02B1F76CCCF6F41BE48951F6E1A6205; Path=/home
header: Content-Type: text/html;charset=utf-8
header: Content-Length: 0
header: Date: Thu, 28 Nov 2013 19:12:26 GMT
header: Connection: close

和....

>>> html3=urllib.urlopen('http://www.xxxxxxxxltd.com/web/guest/attendancelist')
send: 'GET /web/guest/attendancelist HTTP/1.0\r\nHost: www.xxxxxxxxltd.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n'
reply: 'HTTP/1.1 302 Moved Temporarily\r\n'
header: Server: nginx/0.7.67
header: Date: Thu, 28 Nov 2013 19:10:36 GMT
header: Connection: close
header: Location: http://www.xxxxxxxxplc.com/web/guest/attendancelist
send: 'GET /web/guest/attendancelist HTTP/1.0\r\nHost: www.xxxxxxxxplc.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n'
reply: 'HTTP/1.1 301 Moved Permanently\r\n'
header: Server: Apache-Coyote/1.1
header: Location: /home/new/attendancelist.jsp
header: Content-Length: 0
header: Date: Thu, 28 Nov 2013 19:10:34 GMT
header: Connection: close
send: 'GET /home/new/attendancelist.jsp HTTP/1.0\r\nHost: www.xxxxxxxxplc.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: Apache-Coyote/1.1
header: Set-Cookie: JSESSIONID=8CFB903B80C42CA3DA37EDF90D84FF99; Path=/home
header: Content-Type: text/html;charset=utf-8
header: Date: Thu, 28 Nov 2013 19:10:35 GMT
header: Connection: close

可以识别,urllib2连接流具有更多的Connection标头(其中一个是Connection标头,其值为Close)。

任何人都可以帮助找到urllib2无法在urllib模块运行良好时检索数据的原因。

我确信它与Connection标题有关,但我想要某种确认和思考过程解释。

感谢。

1 个答案:

答案 0 :(得分:0)

我建议使用curl调试复制urllib使用的两个版本的头文件。通过一些试验和错误,您应该能够找到导致问题的标题并从那里开始。