Question

我有大量的小文件可以从s3下载和处理。

下载速度非常快，因为每个单独的文件只有几兆字节。两者合计约为100GB。处理时间大约是下载时间的两倍，并且完全受CPU限制。因此，通过在下载其他文件的同时完成多个线程中的处理，应该可以缩短整体运行时间。

当前，我正在下载文件，对其进行处理并继续处理下一个文件。 python中有一种方法可以依次下载所有文件并在完成下载后立即处理每个文件吗？此处的主要区别在于，在处理每个文件时，另一个文件始终在下载。

我的代码如下：

files = {'txt': ['filepath1', 'filepath2', ...], 
         'tsv': ['filepath1', 'filepath2', ...]
        } 

for kind in files.keys():
    subprocess.check_call(f'mkdir -p {kind}', shell=True)
    subprocess.call(f'mkdir -p {kind}/normalized', shell=True)

    for i, file in enumerate(files[kind]):
        subprocess.call(f'aws s3 cp s3://mys3bucket.com/{file} {kind}/', shell=True)
        f = file.split('/')[-1]
        subprocess.check_call('my_process_function --input "{kind}/{f}" --output "{kind}/normalized/normalize_{f}" --units relab', shell=True)

我还写了一个多处理解决方案，可以同时下载和处理多个文件，但这并不能提高速度，因为网络速度已经饱和。瓶颈正在处理中。为了帮助您，我将其包含在内。

from contextlib import closing
from os import cpu_count
from multiprocessing import Pool

def download_and_proc(file, kind='txt'):
    subprocess.call(f'aws s3 cp s3://mys3bucket.com/{file} {kind}/', shell=True)
    f = file.split('/')[-1]
    subprocess.check_call('my_process_function --input "{kind}/{f}" --output "{kind}/normalized/normalize_{f}" --units relab', shell=True)

with closing(Pool(processes=cpu_count()*2)) as pool:
        pool.map(download_and_proc, files)

Answer 1

Your current multiprocessing code should be pretty close to optimal over the long term. It won't always be downloading at maximum speed, since the same threads of execution that are responsible for downloading a file will wait until the file has been processed before downloading another one. But it should usually have all the CPU consumed in processing, even if some network capacity is going unused. If you tried to always be downloading too, you'd eventually run out of files to download and the network would go idle for the same amount of time, just all at the end of the batch job.

One possible exception is if the time taken to process a file is always exactly the same. Then you might find your workers running in lockstep, where they all download at the same time, then all process at the same time, even though there are more workers than there are CPUs for them to run on. Unless the processing is somehow tied to a real time clock, that doesn't seem likely to occur for very long. Most of the time you'd have some processes finishing before others, and so the downloads would end up getting staggered.

So improving your code is not likely to give you much in the way of a speedup. If you think you need it though, you could split the downloading and processing into two separate pools. It might even work to do one of them as a single-process loop in the main process, but I'll show the full two-pool version here:

def download_worker(file, kind='txt'):
    subprocess.call(f'aws s3 cp s3://mys3bucket.com/{file} {kind}/', shell=True)
    return file

def processing_worker(file, kind='txt')
    f = file.split('/')[-1]
    subprocess.check_call('my_process_function --input "{kind}/{f}" --output "{kind}/normalized/normalize_{f}" --units relab', shell=True)

with Pool() as download_pool, Pool() as processing_pool:
    downloaded_iterator = download_pool.imap(download_worker, files)  # imap returns an iterator
    processing_pool.map(processing_worker, downloaded_iterator)

This should both download and process as fast as your system is capable. If the downloading of a file takes less time that its processing, then it's pretty likely that the first pool will be done before the second one, which the code will handle just fine. If the processing is not the bottleneck, it will support that too (the second pool will be idle some of the time, waiting on files to finish downloading).

当一组依赖另一组时，如何在python中同时完成两组任务？

1 个答案: