Question

test.txt包含要下载的文件列表：

http://example.com/example/afaf1.tif
http://example.com/example/afaf2.tif
http://example.com/example/afaf3.tif
http://example.com/example/afaf4.tif
http://example.com/example/afaf5.tif

如何使用python以最大下载速度下载这些文件？

我的想法如下：

import urllib.request
with open ('test.txt', 'r') as f:
    lines = f.read().splitlines()
    for line in lines:
        response = urllib.request.urlopen(line)

之后是什么？如何选择下载目录？

Answer 1

选择所需输出目录的路径（output_dir）。在你的for循环中，将/字符上的每个网址分开，并使用最后一个和平作为文件名。同时打开文件以二进制模式wb写入，因为response.read()返回bytes，而不是str。

import os
import urllib.request

output_dir = 'path/to/you/output/dir'

with open ('test.txt', 'r') as f:
    lines = f.read().splitlines()
    for line in lines:
        response = urllib.request.urlopen(line)
        output_file = os.path.join(output_dir, line.split('/')[-1])
        with open(output_file, 'wb') as writer:
            writer.write(response.read())

注意：

如果您使用多个线程，则下载多个文件会更快，因为下载很少使用您的互联网连接的全部带宽._

此外，如果您下载的文件非常大，您应该流式读取（按块读取块）。正如@Tiran所述，您应该使用shutil.copyfileobj(response, writer)代替writer.write(response.read())。

我只想补充一点，你应该总是指定长度参数：shutil.copyfileobj(response, writer, 5*1024*1024) # (at least 5MB)因为默认值16kb非常小而且只会减慢速度。

Answer 2

这对我来说很好:(注意名称必须是绝对的，例如'afaf1.tif'）

import urllib,os
def download(baseUrl,fileName,layer=0):
    print 'Trying to download file:',fileName
    url = baseUrl+fileName
    name = os.path.join('foldertodwonload',fileName)
    try:
        #Note that folder needs to exist
        urllib.urlretrieve (url,name)
    except:
        # Upon failure to download retries total 5 times
        print 'Download failed'
        print 'Could not download file:',fileName
        if layer > 4:
            return
        else:
            layer+=1
        print 'retrying',str(layer)+'/5'
        download(baseUrl,fileName,layer)
    print fileName+' downloaded'

for fileName in nameList:
    download(url,fileName)

从try block

移出不必要的代码

使用python下载大量文件

2 个答案: