我的团队每天需要提取30多个文件,平均每件大约5到10千兆字节。定时执行一个urllib2请求的文件,每个文件按顺序大约需要1.5到2个小时,这导致每天只下载12个文件。这些30多个文件每天生成,需要在我们的数据分析团队的所有其他下载和自动流程之上。但是,如果我可以一次下载几个文件,带宽损失最小,那将是理想的。
我从我们系统上的一些剩余代码中找到了这个方法,但我想知道这是否真的更好或者只是看起来更好。从测试它似乎适用于3到10个文件,但之后它会减慢其他实例。还有一个问题。我想一次打开5到10个实例,因为那时我注意到带宽减慢了。我认为5是最佳点,所以我如何让script1等待并检查以确保所有文件都已在script1.py中完成下载,直到迭代地再开放5个script2.py实例。 urllib3会更好吗?我对线程或多进程库不太熟悉。
#script1.py
import subprocess, time
lines = 0
homepath = "C:\\Auto_tasks\\downloader\\logs"
url_list_local = "c:\\Requests\\download_urls.txt"
targets_file = open(url_list_local, 'r')
for line in targets_file:
url = line.rstrip('\n')
surl ("\"C:\\Python26\\python.exe"
\"C:\\Auto_tasks\\downloader\\scripts\\script2.py\" " + url + " \"" + homepath
+ "\"")
subprocess.Popen(surl)
lines += 1
time.sleep(1)
#script2.py, individual instances opened simultaneously for n files
import urllib2, time, os, sys, shutil, subprocess
os.chdir("C:\\Auto_tasks\\downloader\\working") #sets directory where downloads will go
homepath = sys.argv[2]
url = sys.argv[1]
file_name = url.split('/')[-1]
surl ("\"C:\\Python26\\python.exe"
\"C:\\Auto_tasks\\downloader\\scripts\\script2.py\" " + url + " \"" + homepath
+ "\"")
try:
u = urllib2.urlopen(url)
except IOError:
print "FAILED to start download, retrying..."
time.sleep(30)
subprocess.popen(surl)
src_file = "C:\\Auto_tasks\\downloader\\working\\" + file_name
dst_file = "C:\\Auto_tasks\\downloader\\completed"
shutil.move(src_file, dst_file)
答案 0 :(得分:0)
下载多个文件是一项非常常见的任务。在Linux世界中,wget可以为您处理带宽和许多其他功能 - 此工具可能适用于Windows。
使用Python流程池来实现它,这是一种方法:
# downpool.py
import multiprocessing
import os, shutil, sys, urllib
def downloader(url):
mylog = multiprocessing.get_logger()
mylog.info('start')
mylog.info('%s: downloading', url)
# download to temporary directory
(temp_path, _headers) = urllib.urlretrieve(url)
# move to final directory, preserving file name
dest_path = os.path.join('temp', os.path.basename(temp_path))
shutil.move(temp_path, dest_path)
mylog.info('%s: done', url)
return dest_path
plog = multiprocessing.log_to_stderr()
import logging
plog.setLevel(logging.INFO)
download_urls = [ line.strip() for line in open( sys.argv[1] ) ]
plog.info('starting parallel downloads of %d urls', len(download_urls))
pool = multiprocessing.Pool(5)
plog.info('running jobs')
download_paths = list( pool.imap( downloader, download_urls ) )
plog.info('done')
print 'Downloaded:\n', '\n'.join( download_paths )