Question

我有一个程序可以将大量文件从一个位置复制到另一个位置 - 我正在谈论100,000多个文件（此时此刻我在图像序列中复制314g）。它们都是巨大的，非常快速的网络存储RAID。我使用shutil按顺序复制文件，这需要一些时间，所以我试图找到最佳方法来优化它。我注意到一些软件，我有效地使用多线程读取网络文件，加载时间大大增加，所以我想在python中尝试这样做。

我没有编程多线程/多进程编程的经验 - 这看起来像是正确的领域吗？如果是这样，最好的方法是什么？我已经浏览了一些关于在python中进行文件复制的SO帖子，他们似乎都说没有速度增加，但考虑到我的硬件，我认为不会这样。我目前的IO上限远不及我的资源占1％左右（我本地有40个内核和64g内存）。

斯潘塞

Answer 1

更新：

我从来没有让Gevent工作（第一个答案），因为我无法在没有互联网连接的情况下安装模块，而我的工作站上没有这个连接。然而，我只能使用内置的python线程（我已经学会了如何使用）将文件复制时间缩短了8个，我希望将其作为额外答案发布给任何感兴趣的人！以下是我的代码，可能需要注意的是，由于您的硬件/网络设置，我的8倍复制时间很可能因环境而异。

import Queue, threading, os, time
import shutil

fileQueue = Queue.Queue()
destPath = 'path/to/cop'

class ThreadedCopy:
    totalFiles = 0
    copyCount = 0
    lock = threading.Lock()

    def __init__(self):
        with open("filelist.txt", "r") as txt: #txt with a file per line
            fileList = txt.read().splitlines()

        if not os.path.exists(destPath):
            os.mkdir(destPath)

        self.totalFiles = len(fileList)

        print str(self.totalFiles) + " files to copy."
        self.threadWorkerCopy(fileList)


    def CopyWorker(self):
        while True:
            fileName = fileQueue.get()
            shutil.copy(fileName, destPath)
            fileQueue.task_done()
            with self.lock:
                self.copyCount += 1
                percent = (self.copyCount * 100) / self.totalFiles
                print str(percent) + " percent copied."

    def threadWorkerCopy(self, fileNameList):
        for i in range(16):
            t = threading.Thread(target=self.CopyWorker)
            t.daemon = True
            t.start()
        for fileName in fileNameList:
            fileQueue.put(fileName)
        fileQueue.join()

ThreadedCopy()

Answer 2

这可以通过在Python中使用gevent来并行化。

我建议使用以下逻辑来实现加速100k +文件复制：

将所有需要复制的100K +文件的名称放在csv文件中，例如：＆＃39; input.csv＆＃39;。
然后从该csv文件创建块。应根据计算机中处理器/内核的数量来确定块的数量。
将每个块传递给单独的线程。
每个线程按顺序读取该块中的文件名，并将其从一个位置复制到另一个位置。

这里是python代码片段：

import sys
import os
import multiprocessing

from gevent import monkey
monkey.patch_all()

from gevent.pool import Pool

def _copyFile(file):
    # over here, you can put your own logic of copying a file from source to destination

def _worker(csv_file, chunk):
    f = open(csv_file)
    f.seek(chunk[0])
    for file in f.read(chunk[1]).splitlines():
        _copyFile(file)


def _getChunks(file, size):
    f = open(file)
    while 1:
        start = f.tell()
        f.seek(size, 1)
        s = f.readline()
        yield start, f.tell() - start
        if not s:
            f.close()
            break

if __name__ == "__main__":
    if(len(sys.argv) > 1):
        csv_file_name = sys.argv[1]
    else:
        print "Please provide a csv file as an argument."
        sys.exit()

    no_of_procs = multiprocessing.cpu_count() * 4

    file_size = os.stat(csv_file_name).st_size

    file_size_per_chunk = file_size/no_of_procs

    pool = Pool(no_of_procs)

    for chunk in _getChunks(csv_file_name, file_size_per_chunk):
        pool.apply_async(_worker, (csv_file_name, chunk))

    pool.join()

将文件另存为file_copier.py。打开终端并运行：

$ ./file_copier.py input.csv

Answer 3

在重新实现@Spencer发布的代码时，我遇到了帖子下方评论中提到的相同错误（更具体地说：OSError: [Errno 24] Too many open files）。我通过远离守护线程并使用concurrent.futures.ThreadPoolExecutor来解决了这个问题。这似乎可以更好地处理要复制的文件的打开和关闭。这样，除了threadWorkerCopy(self, filename_list: List[str])方法外，所有代码都保持不变，现在看起来像这样：

    def threadWorkerCopy(self, filename_list: List[str]):
    """
    This function initializes the workers to enable the multi-threaded process. The workers are handles automatically with
    ThreadPoolExecutor. More infos about multi-threading can be found here: https://realpython.com/intro-to-python-threading/.
    A recurrent problem with the threading here was "OSError: [Errno 24] Too many open files". This was coming from the fact
    that deamon threads were not killed before the end of the script. Therefore, everything opened by them was never closed.

    Args:
        filename_list (List[str]): List containing the name of the files to copy.
    """
    with concurrent.futures.ThreadPoolExecutor(max_workers=cores) as executor:
        executor.submit(self.CopyWorker)

        for filename in filename_list:
            self.file_queue.put(filename)
        self.file_queue.join()  # program waits for this process to be done.

Answer 4

如何使用import sequtils let nlist = @[1, 2] let slist: seq[string] = map[int, string](nlist, proc (v: auto): auto = $v)？

ThreadPool

Answer 5

如果您只想将目录树从一个路径复制到另一个路径，这是我的解决方案，它比以前的解决方案要简单得多。它利用multiprocessing.pool.ThreadPool并为shutil.copytree使用自定义复制功能：

import shutil
from multiprocessing.pool import ThreadPool


class MultithreadedCopier:
    def __init__(self, max_threads):
        self.pool = ThreadPool(max_threads)

    def copy(self, source, dest):
        self.pool.apply_async(shutil.copy2, args=(source, dest))

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.pool.close()
        self.pool.join()


src_dir = "/path/to/src/dir"
dest_dir = "/path/to/dest/dir"


with MultithreadedCopier(max_threads=16) as copier:
    shutil.copytree(src_dir, dest_dir, copy_function=copier.copy)

Python多进程/多线程以加速文件复制

5 个答案: