我遇到了问题,即使用requests
从网址下载文件的代码没有明显原因。当我启动脚本时,它会下载几百个文件,但之后就会停在某个地方。如果我在浏览器中手动尝试URL,图像加载没有问题。我也试过urllib.retrieve,但遇到了同样的问题。我在OSX上使用Python 2.7.5。
跟着你找到
dtruss
),而程序停止并且ctrl-c
过程中没有任何事情发生10分钟后代码:
def download_from_url(url, download_path):
with open(download_path, 'wb') as handle:
response = requests.get(url, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
def download_photos_from_urls(urls, concept):
ensure_path_exists(concept)
bad_results = list()
for i, url in enumerate(urls):
print i, url,
download_path = concept+'/'+url.split('/')[-1]
try:
download_from_url(url, download_path)
print
except IOError as e:
print str(e)
return bad_result
堆栈跟踪:
My-desk:~ Me$ sudo dtruss -p 708
SYSCALL(args) = return
回溯:
318 http://farm1.static.flickr.com/32/47394454_10e6d7fd6d.jpg
Traceback (most recent call last):
File "slow_download.py", line 71, in <module>
if final_path == '':
File "slow_download.py", line 34, in download_photos_from_urls
download_path = concept+'/'+url.split('/')[-1]
File "slow_download.py", line 21, in download_from_url
with open(download_path, 'wb') as handle:
File "/Library/Python/2.7/site-packages/requests/models.py", line 638, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/Library/Python/2.7/site-packages/requests/packages/urllib3/response.py", line 256, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/Library/Python/2.7/site-packages/requests/packages/urllib3/response.py", line 186, in read
data = self._fp.read(amt)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 567, in read
s = self.fp.read(amt)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
KeyboardInterrupt
答案 0 :(得分:1)
也许缺乏汇集可能会导致太多连接。尝试这样的事情(using a session):
import requests
session = requests.Session()
def download_from_url(url, download_path):
with open(download_path, 'wb') as handle:
response = session.get(url, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
def download_photos_from_urls(urls, concept):
ensure_path_exists(concept)
bad_results = list()
for i, url in enumerate(urls):
print i, url,
download_path = concept+'/'+url.split('/')[-1]
try:
download_from_url(url, download_path)
print
except IOError as e:
print str(e)
return bad_result
答案 1 :(得分:1)
所以,只是为了统一所有的评论,并提出一个潜在的解决方案:有几个原因导致你的下载在几百之后失败 - 它可能是Python内部的,例如命中最大打开文件数处理,或者它可能是服务器阻止您成为机器人的问题。
您没有共享所有代码,所以说起来有点难,但至少与您展示的内容有关打开要写入的文件时with
上下文管理器,因此您不应该遇到问题。退出循环后,请求对象可能无法正常关闭,但我将向您展示如何处理以下内容。
默认requests
用户代理是(在我的机器上):
python-requests/2.4.1 CPython/3.4.1 Windows/8
因此,想象您正在请求的服务器正在筛选各种类似的UA并限制其连接数量,这是不可思议的。您能够获得与urllib.retrieve
一起使用的代码的原因是它的UA与请求不同,因此服务器允许它继续大约相同数量的请求,然后将其关闭,太
为了解决这些问题,我建议将download_from_url()
函数更改为以下内容:
import requests
from time import sleep
def download_from_url(url, download_path, delay=5):
headers = {'Accept-Encoding': 'identity, deflate, compress, gzip',
'Accept': '*/*',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0'}
with open(download_path, 'wb') as handle:
response = requests.get(url, headers=headers) # no stream=True, that could be an issue
handle.write(response.content)
response.close()
sleep(delay)
我们不使用stream=True
,而是使用默认值False
来立即下载请求的完整内容。 headers
dict包含一些默认值,以及最重要的'User-Agent'
值,在此示例中恰好是我的UA,使用What'sMyUserAgent确定。您可以将其更改为首选浏览器返回的内容。这里我只是将整个内容写入磁盘,消除了无关的代码和一些潜在的错误来源 - 例如,如果你的网络连接出现打嗝,你可以暂时有空块,并错误地突破。我还明确地关闭了请求,以防万一。最后,我在函数delay
中添加了一个额外的参数,以使函数在返回之前休眠一定的秒数。我给它一个默认值5,你可以随意做它(它也接受小数秒的浮点数)。
我碰巧没有大量的图片网址来测试这个,但它应该按预期工作。祝你好运!