当我调用以下函数来处理一长串URL(访问同一个站点(即http://foo.bar.com/url1
,http://foo.bar.com/url2
等)时:
import time
import grequests
def processUrls(block=2500, write=100000, timeout=0.5):
urls = ... ## generate long array of URLs
chunks = [urls[i:i+block] for i in xrange(0, len(urls), block)] ## chunk 'em
def callback(response, *args, **kwargs):
txt = response.text
## do something with txt
response.close()
for i, chunk in enumerate(chunks):
rs = [grequests.get(url, callback=callback) for url in chunk]
grequests.map(rs, stream=False, size=block / 10)
time.sleep(timeout)
## do stuff
我收到一堆这样的错误:
File "/.../python2.7/site-packages/gevent/greenlet.py", line 327, in run
result = self._run(*self.args, **self.kwargs)
File "/.../python2.7/site-packages/grequests.py", line 71, in send
self.url, **merged_kwargs)
File "/.../python2.7/site-packages/requests/sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "/.../python2.7/site-packages/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/.../python2.7/site-packages/requests/adapters.py", line 415, in send
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', error(97, 'Address family not supported by protocol'))
<Greenlet at 0x7f8ce2c0ec30: <bound method AsyncRequest.send of <grequests.AsyncRequest object at 0x7f8ce31e2890>>(stream=False)> failed with ConnectionError
邮件数量远远小于网址数量。
可能导致这些错误的原因是什么?我在RedHat 6.6上运行它
更新:我从我一直在使用的完整数据集中收集了所有给我错误的网址。它们似乎都很好(格式良好等),当我将其中一个粘贴到浏览器中时,我得到了有意义的结果,没有错误信息。然后,我只用一部分数据重新进行测试。再次,得到一些错误并收集子集的错误URL列表。事实证明,子集中的任何坏URL都不在整个集的坏URL列表中。这表明错误不是特定于URL的,而是某种类型的打嗝,无论是在我身边还是在另一边。这会响铃吗?