我正在尝试抓取Google搜索结果,当我使用这样的域名时,一切都很好:
import requests
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
requests.get('https://google.com/search?q={}'.format('movie'),\
verify=False, headers={'User-Agent': user_agent})
但是当我使用IP抓取Google时:
requests.get('https://216.58.207.78/search?q={}'.format('movie'),\
verify=False, headers={'User-Agent': user_agent, 'host': 'google.com'})
出现以下错误:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 668, in send
history = [resp for resp in gen] if allow_redirects else []
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 668, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 165, in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
我该如何解决?
答案 0 :(得分:2)
通过在您的www.
中添加Host
来解决此问题:
requests.get('https://216.58.207.78/search?q={}'.format('movie'),\
verify=False, headers={'User-Agent': user_agent, 'host': 'www.google.com'})
说明:
之所以会这样,是因为您在google.com
HTTP标头中使用了Host
。
当google收到您的请求时,它会在HTTP标头中看到您期望google.com
,因此它们将您重定向到www.google.com
。但是,当请求遵循重定向时,它将发送您请求的标头,并在google.com
中发送Host
。因此,服务器会再次重定向您,依此类推。
您还可以只删除Host
标头,就我所知,它没有任何区别。