我正在尝试使用python-request包从Web下载大量文件(如10k +),每个文件大小从几k到最大为100mb。
我的脚本可以运行很好,可能有3000个文件,但突然它会挂起。 我ctrl-c它看到它停留在
r = requests.get(url, headers=headers, stream=True)
File "/Library/Python/2.7/site-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 456, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 559, in send
r = adapter.send(request, **kwargs)
File "/Library/Python/2.7/site-packages/requests/adapters.py", line 327, in send
timeout=timeout
File "/Library/Python/2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 493, in urlopen
body=body, headers=headers)
File "/Library/Python/2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 319, in _make_request
httplib_response = conn.getresponse(buffering=True)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
这是我的python代码来进行下载
basedir = os.path.dirname(filepath)
if not os.path.exists(basedir):
os.makedirs(basedir)
r = requests.get(url, headers=headers, stream=True)
with open(filepath, 'w') as f:
for chunk in r.iter_content(1024):
if chunk:
f.write(chunk)
f.flush()
我不确定出了什么问题,如果有人有线索,请分享一些见解。 感谢。
答案 0 :(得分:0)
这与@alfasin在评论中链接的问题不重复。根据您发布的(有限)回溯判断,请求本身已挂起(第一行显示它正在执行r = requests.get(url, headers=headers, stream=True)
)。
您应该做的是设置超时并捕获请求超时时引发的异常。获得URL后,在浏览器中使用或使用curl进行尝试以确保其正确响应,否则将其从要请求的URL列表中删除。如果您发现行为不当的网址,请用它更新您的问题。
答案 1 :(得分:0)
我遇到了类似的情况,似乎请求包中的错误导致了这个问题。升级到请求包2.10.0为我修复了它。
作为参考,请求2.10.0的更改日志显示嵌入式urllib3已更新为版本1.15.1 Release history
urllib3(Release history)的发布历史记录显示版本1.15.1包含以下修复程序: