Question

我对这一切都很陌生;我需要获得数千个sourceforge项目的数据，以便我写一篇论文。数据全部以json格式免费提供，网址为http://sourceforge.net/api/project/name/[project name] / json。我有几千个URL的列表，我使用以下代码。

import grequests
rs = (grequests.get(u) for u in ulist)
answers = grequests.map(rs)

使用此代码，我可以获得我喜欢的任何200个左右项目的数据，即rs = (grequests.get(u) for u in ulist[0:199])有效，但是一旦我完成，所有尝试都会被

ConnectionError: HTTPConnectionPool(host='sourceforge.net', port=80): Max retries exceeded with url: /api/project/name/p2p-fs/json (Caused by <class 'socket.gaierror'>: [Errno 8] nodename nor servname provided, or not known)
<Greenlet at 0x109b790f0: <bound method AsyncRequest.send of <grequests.AsyncRequest object at 0x10999ef50>>(stream=False)> failed with ConnectionError

在我退出python之前，我无法再发出任何请求，但是一旦我重新启动python，我就可以再发出200个请求。

我尝试过使用grequests.map(rs,size=200)，但这似乎什么也没做。

Answer 1

所以，我在这里回答，也许它会帮助别人。

在我的情况下，目标服务器没有速率限制，但更简单：我没有显式关闭响应，所以他们保持套接字打开，并且python进程用完了文件句柄。

我的解决方案（不确定哪一个解决了这个问题 - 理论上它们中的任何一个应该）是：

在stream=False中设置grequests.get：

rs = (grequests.get(u, stream=False) for u in urls)

在阅读response.content后，显式调用response.close()：

responses = grequests.map(rs)
for response in responses:
      make_use_of(response.content)
      response.close()

注意：只是破坏response对象（为None分配，调用gc.collect()）是不够的 - 这并没有关闭文件处理程序。

Answer 2

可以轻松更改此选项以使用您想要的任何数量的连接。

MAX_CONNECTIONS = 100 #Number of connections you want to limit it to
# urlsList: Your list of URLs. 

results = []
for x in range(1,pages+1, MAX_CONNECTIONS):
    rs = (grequests.get(u, stream=False) for u in urlsList[x:x+MAX_CONNECTIONS])
    time.sleep(0.2) #You can change this to whatever you see works better. 
    results.extend(grequests.map(rs)) #The key here is to extend, not append, not insert. 
    print("Waiting") #Optional, so you see something is done.

使用grequests向sourceforge发出数千个请求，获取“使用url超出最大重试次数”

2 个答案: