使用Google的IP代替域名时的TooManyRedirects

时间:2019-08-31 13:03:03

标签: python http python-requests http-headers web-crawler

我正在尝试抓取Google搜索结果,当我使用这样的域名时,一切都很好:

import requests
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
requests.get('https://google.com/search?q={}'.format('movie'),\
    verify=False, headers={'User-Agent': user_agent})

但是当我使用IP抓取Google时:

requests.get('https://216.58.207.78/search?q={}'.format('movie'),\
    verify=False, headers={'User-Agent': user_agent, 'host': 'google.com'})

出现以下错误:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
  File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
  File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
  File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 668, in send
history = [resp for resp in gen] if allow_redirects else []
  File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 668, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
  File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 165, in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.

我该如何解决?

1 个答案:

答案 0 :(得分:2)

通过在您的www.中添加Host来解决此问题:

requests.get('https://216.58.207.78/search?q={}'.format('movie'),\
    verify=False, headers={'User-Agent': user_agent, 'host': 'www.google.com'})

说明

之所以会这样,是因为您在google.com HTTP标头中使用了Host

当google收到您的请求时,它会在HTTP标头中看到您期望google.com,因此它们将您重定向到www.google.com。但是,当请求遵循重定向时,它将发送您请求的标头,并在google.com中发送Host。因此,服务器会再次重定向您,依此类推。

您还可以只删除Host标头,就我所知,它没有任何区别。