如何使用Python线程更快地写入大量小文件

时间:2019-06-05 17:52:23

标签: python file

我想使用python创建约50000个文件,它们是非常简单的文件,每个文件少于20行。

我第一次尝试添加线程就是为了这样做,而在我的i7第八代机器上花了220秒。

有螺纹


def random_files(i):
    filepath = path+"/content/%s.html" %(str(i))
    fileobj = open(filepath,"w+")
    l1 = "---\n"
    l2 = 'title: "test"\n'
    l3 = "date: 2019-05-01T18:37:07+05:30"+"\n"
    l4 = "draft: false"+"\n"
    l5 = 'type: "statecity"'+"\n"
    l6 = "---"+"\n"
    data = l1+l2+l3+l4+l5+l6
    fileobj.writelines(data)
    fileobj.close()

if __name__ == "__main__":
    start_time = time.time()
    for i in range(0, 50000):
        i = str(i)
        threading.Thread(name='random_files', target=random_files, args=(i,)).start()
    print("--- %s seconds ---" % (time.time() - start_time))

没有螺纹

执行非线程路由需要55秒。

def random_files():
    for i in range(0, 50000):
        filepath = path+"/content/%s.html" %(str(i))
        fileobj = open(filepath,"w+")
        l1 = "---\n"
        l2 = 'title: "test"\n'
        l3 = "date: 2019-05-01T18:37:07+05:30"+"\n"
        l4 = "draft: false"+"\n"
        l5 = 'type: "statecity"'+"\n"
        l6 = "---"+"\n"
        data = l1+l2+l3+l4+l5+l6
        fileobj.writelines(data)
        fileobj.close()

if __name__ == "__main__":
    start_time = time.time()
    random_files()
    print("--- %s seconds ---" % (time.time() - start_time))

python任务的CPU使用率为10% RAM使用量仅为50mb 磁盘使用量平均为4.5 Mb /秒

速度可以大大提高吗?

1 个答案:

答案 0 :(得分:1)

尝试线程化,将负载平均分配给系统中的每个线程。

这为负载分配的线程数提供了几乎线性的加速:

无线程:

〜11%CPU〜5MB / s磁盘

--- 69.15089249610901秒---


具有线程:4个线程

22%CPU 13MB / s磁盘

--- 29.21335482597351秒---


具有线程:8个线程

27%CPU 15MB / s磁盘

--- 20.8521249294281秒---


例如:

import time
from threading import Thread

def random_files(i):
    filepath = path+"/content/%s.html" %(str(i))
    fileobj = open(filepath,"w+")
    l1 = "---\n"
    l2 = 'title: "test"\n'
    l3 = "date: 2019-05-01T18:37:07+05:30"+"\n"
    l4 = "draft: false"+"\n"
    l5 = 'type: "statecity"'+"\n"
    l6 = "---"+"\n"
    data = l1+l2+l3+l4+l5+l6
    fileobj.writelines(data)
    fileobj.close()

def pool(start,number):
    for i in range(int(start),int(start+number)):
        random_files(i)

if __name__ == "__main__":
    start_time = time.time()
    num_files = 50000
    threads = 8
    batch_size = num_files/threads
    thread_list = [Thread(name='random_files', target=pool, args=(batch_size * thread_index ,batch_size)) for thread_index  in range(threads)]
    [t.start() for t in thread_list]
    [t.join() for t in thread_list] // simply required to wait for each of the threads to finish before stopping the timer

    print("--- %s seconds ---" % (time.time() - start_time))

但是,此处提供的解决方案仅是一个示例,用于说明可以实现的速度提高。仅将50,000个文件分成8个批处理(每个线程一个),该方法只能将文件分为几批,pool()函数将需要更强大的解决方案,以将负载分为几批。 / p>

this SO example为例,尝试在线程之间分配不均匀负载。

希望这会有所帮助!