Question

我正在尝试同时下载一堆url，其中包含请求模块和python内置的多处理库。当两者结合使用时，我遇到了一些肯定看起来不正确的错误。我用100个线程发出了100个请求，通常其中50个成功结束，而另外50个收到此消息：

   TTPConnectionPool(host='www.reuters.com', port=80): Max retries exceeded with url: 
/video/2013/10/07/breakingviews-batistas-costly-bluster?videoId=274054858&feedType=VideoRSS&feedName=Business&videoChannel=5&utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+reuters%2FUSVideoBusiness+%28Video+%2F+US+%2F+Business%29 (Caused by <class 'socket.gaierror'>: [Errno 8] nodename nor servname provided, or not known)

最大重试次数和未提供的节点名称行看起来都不正确。

以下是我的请求设置：

import requests

req_kwargs = {
  'headers' : {'User-Agent': 'np/0.0.1'},
  'timeout' : 7,
  'allow_redirects' : True
}

# I left out the multiprocessing code but that part isn't important
resp = requests.get(some_url, req_kwargs**)

有没有人知道如何预防或至少进一步调试这个？

谢谢。

Answer 1

我认为这可能是由网站不允许的高访问频率引起的。

尝试以下方法：

只需使用较低的访问频率来抓取该网站，当您再次收到相同的错误时，请在您的网络浏览器中访问该网站，看看该网站是否禁止了该蜘蛛。
使用代理池抓取网站，以防止网站认为您的访问频率较高并禁止您的蜘蛛。
丰富您的http请求标头，使其像网络浏览器一样发出。

Answer 2

[Errno 8]提供nodename或servname，或者不知道

简单地暗示它无法解析 www.reuters.com 将ip分辨率放在hosts文件或域

中

Python请求多线程“超出url的最大重试次数”由<class'socket.gaierror'=“”> </class>引起

2 个答案: