在标题中获取包含host
的网址会引发异常Exceeded 30 redirects
这太奇怪了,我无法弄清楚
以下是测试代码:
>>> url = 'http://bbs.duchang8.com/forum-29-1.html'
>>> r = requests.get(url)
>>> print r.status_code
200
>>> headers = {
... 'Host': 'bbs.duchang8.com',
... }
>>> r = requests.get(url, headers=headers)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/www/article_fetcher/venv/local/lib/python2.7/site-packages/requests/api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "/data/www/article_fetcher/venv/local/lib/python2.7/site-packages/requests/api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "/data/www/article_fetcher/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "/data/www/article_fetcher/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 594, in send
history = [resp for resp in gen] if allow_redirects else []
File "/data/www/article_fetcher/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 114, in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
答案 0 :(得分:3)
简答:
不要覆盖Host:
标题。
或者,使用重定向客户端的主机覆盖它。
答案很长
通过明确设置Host
标头,您告诉requests
在所有后续请求中使用该标头,包括因重定向响应而重新发出的任何请求服务器
在这种情况下,requests
客户端被重定向到由不同服务器托管的位置http://www.duchang8.com/forum-29-1.html
; www.duchang8.com
与bbs.duchang8.com
。虽然两个主机名都解析为相同的IP地址,但远程HTTP服务器对它们的处理方式不同。
nett结果是requests
继续使用您提供的Host:
标头,而不是服务器返回的正确标头。然后,由于URL /服务器主机与Host:
标头不匹配,将拒绝(通过重定向)对新位置的后续请求。
>>> import requests
>>> url = 'http://bbs.duchang8.com/forum-29-1.html'
>>> r = requests.get(url)
>>> r
<Response [200]>
>>> r.history
[<Response [301]>]
>>> r.history[0].headers
{'content-length': '178', 'server': 'nginx', 'connection': 'keep-alive', 'location': 'http://www.duchang8.com/forum-29-1.html', 'date': 'Mon, 03 Aug 2015 12:20:31 GMT', 'content-type': 'text/html'}
我们在此处看到客户端被HTTP 301响应和http://www.duchang8.com/forum-29-1.html
标头重定向到location:
。
使用curl
,您可以看到在获取新位置时尝试提供不同的Host:
标头会发生什么:
$ curl -v -L -H 'Host: bbs.duchang8.com' http://www.duchang8.com/forum-29-1.html
* Trying 61.160.249.39...
* Connected to www.duchang8.com (61.160.249.39) port 80 (#0)
> GET /forum-29-1.html HTTP/1.1
> User-Agent: curl/7.40.0
> Accept: */*
> Host: bbs.duchang8.com
>
< HTTP/1.1 301 Moved Permanently
< Server: nginx
< Date: Mon, 03 Aug 2015 12:27:33 GMT
< Content-Type: text/html
< Content-Length: 178
< Connection: keep-alive
< Location: http://www.duchang8.com/forum-29-1.html
<
* Ignoring the response-body
* Connection #0 to host www.duchang8.com left intact
* Issue another request to this URL: 'http://www.duchang8.com/forum-29-1.html'
* Found bundle for host www.duchang8.com: 0x21b54c0
* Re-using existing connection! (#0) with host www.duchang8.com
* Connected to www.duchang8.com (61.160.249.39) port 80 (#0)
> GET /forum-29-1.html HTTP/1.1
> User-Agent: curl/7.40.0
> Accept: */*
> Host: bbs.duchang8.com
>
< HTTP/1.1 301 Moved Permanently
< Server: nginx
< Date: Mon, 03 Aug 2015 12:27:33 GMT
< Content-Type: text/html
< Content-Length: 178
< Connection: keep-alive
< Location: http://www.duchang8.com/forum-29-1.html
<
# and so so, and so on....
它以重定向循环结束。 requests
发生了相同的请求和响应序列,最终决定永远不会结束并中止请求。