使用`request`在中间停顿下载数百个文件。

时间:2014-09-30 16:06:15

标签: python python-requests

我遇到了问题,即使用requests从网址下载文件的代码没有明显原因。当我启动脚本时,它会下载几百个文件,但之后就会停在某个地方。如果我在浏览器中手动尝试URL,图像加载没有问题。我也试过urllib.retrieve,但遇到了同样的问题。我在OSX上使用Python 2.7.5。

跟着你找到

  • 我使用的代码,
  • 堆栈跟踪(dtruss),而程序停止并且
  • 当我ctrl-c过程中没有任何事情发生10分钟后
  • 打印的追溯

代码:

def download_from_url(url, download_path):
    with open(download_path, 'wb') as handle:
        response = requests.get(url, stream=True)
        for block in response.iter_content(1024):
            if not block:
                break
            handle.write(block)

def download_photos_from_urls(urls, concept):
    ensure_path_exists(concept)
    bad_results = list()
    for i, url in enumerate(urls):
        print i, url,
        download_path = concept+'/'+url.split('/')[-1]
        try:
            download_from_url(url, download_path)
            print
        except IOError as e:
            print str(e)
    return bad_result

堆栈跟踪:

My-desk:~ Me$ sudo dtruss -p 708 SYSCALL(args) = return

回溯:

318 http://farm1.static.flickr.com/32/47394454_10e6d7fd6d.jpg
Traceback (most recent call last):
  File "slow_download.py", line 71, in <module>
    if final_path == '':
  File "slow_download.py", line 34, in download_photos_from_urls
    download_path = concept+'/'+url.split('/')[-1]
  File "slow_download.py", line 21, in download_from_url
    with open(download_path, 'wb') as handle:
  File "/Library/Python/2.7/site-packages/requests/models.py", line 638, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/Library/Python/2.7/site-packages/requests/packages/urllib3/response.py", line 256, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/Library/Python/2.7/site-packages/requests/packages/urllib3/response.py", line 186, in read
    data = self._fp.read(amt)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 567, in read
    s = self.fp.read(amt)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
KeyboardInterrupt

2 个答案:

答案 0 :(得分:1)

也许缺乏汇集可能会导致太多连接。尝试这样的事情(using a session):

import requests

session = requests.Session()

def download_from_url(url, download_path):
    with open(download_path, 'wb') as handle:
        response = session.get(url, stream=True)
        for block in response.iter_content(1024):
            if not block:
                break
            handle.write(block)

def download_photos_from_urls(urls, concept):
    ensure_path_exists(concept)
    bad_results = list()
    for i, url in enumerate(urls):
        print i, url,
        download_path = concept+'/'+url.split('/')[-1]
        try:
            download_from_url(url, download_path)
            print
        except IOError as e:
            print str(e)
    return bad_result

答案 1 :(得分:1)

所以,只是为了统一所有的评论,并提出一个潜在的解决方案:有几个原因导致你的下载在几百之后失败 - 它可能是Python内部的,例如命中最大打开文件数处理,或者它可能是服务器阻止您成为机器人的问题。

您没有共享所有代码,所以说起来有点难,但至少与您展示的内容有关打开要写入的文件时with上下文管理器,因此您不应该遇到问题。退出循环后,请求对象可能无法正常关闭,但我将向您展示如何处理以下内容。

默认requests用户代理是(在我的机器上):

python-requests/2.4.1 CPython/3.4.1 Windows/8

因此,想象您正在请求的服务器正在筛选各种类似的UA并限制其连接数量,这是不可思议的。您能够获得与urllib.retrieve一起使用的代码的原因是它的UA与请求不同,因此服务器允许它继续大约相同数量的请求,然后将其关闭,太

为了解决这些问题,我建议将download_from_url()函数更改为以下内容:

import requests
from time import sleep

def download_from_url(url, download_path, delay=5):
    headers = {'Accept-Encoding': 'identity, deflate, compress, gzip', 
               'Accept': '*/*',
               'Connection': 'keep-alive',
               'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0'}
    with open(download_path, 'wb') as handle:
        response = requests.get(url, headers=headers) # no stream=True, that could be an issue
        handle.write(response.content)
        response.close()
        sleep(delay)

我们不使用stream=True,而是使用默认值False来立即下载请求的完整内容。 headers dict包含一些默认值,以及最重要的'User-Agent'值,在此示例中恰好是我的UA,使用What'sMyUserAgent确定。您可以将其更改为首选浏览器返回的内容。这里我只是将整个内容写入磁盘,消除了无关的代码和一些潜在的错误来源 - 例如,如果你的网络连接出现打嗝,你可以暂时有空块,并错误地突破。我还明确地关闭了请求,以防万一。最后,我在函数delay中添加了一个额外的参数,以使函数在返回之前休眠一定的秒数。我给它一个默认值5,你可以随意做它(它也接受小数秒的浮点数)。

我碰巧没有大量的图片网址来测试这个,但它应该按预期工作。祝你好运!