我想使用python创建约50000个文件,它们是非常简单的文件,每个文件少于20行。
我第一次尝试添加线程就是为了这样做,而在我的i7第八代机器上花了220秒。
有螺纹
def random_files(i):
filepath = path+"/content/%s.html" %(str(i))
fileobj = open(filepath,"w+")
l1 = "---\n"
l2 = 'title: "test"\n'
l3 = "date: 2019-05-01T18:37:07+05:30"+"\n"
l4 = "draft: false"+"\n"
l5 = 'type: "statecity"'+"\n"
l6 = "---"+"\n"
data = l1+l2+l3+l4+l5+l6
fileobj.writelines(data)
fileobj.close()
if __name__ == "__main__":
start_time = time.time()
for i in range(0, 50000):
i = str(i)
threading.Thread(name='random_files', target=random_files, args=(i,)).start()
print("--- %s seconds ---" % (time.time() - start_time))
没有螺纹
执行非线程路由需要55秒。
def random_files():
for i in range(0, 50000):
filepath = path+"/content/%s.html" %(str(i))
fileobj = open(filepath,"w+")
l1 = "---\n"
l2 = 'title: "test"\n'
l3 = "date: 2019-05-01T18:37:07+05:30"+"\n"
l4 = "draft: false"+"\n"
l5 = 'type: "statecity"'+"\n"
l6 = "---"+"\n"
data = l1+l2+l3+l4+l5+l6
fileobj.writelines(data)
fileobj.close()
if __name__ == "__main__":
start_time = time.time()
random_files()
print("--- %s seconds ---" % (time.time() - start_time))
python任务的CPU使用率为10% RAM使用量仅为50mb 磁盘使用量平均为4.5 Mb /秒
速度可以大大提高吗?
答案 0 :(得分:1)
尝试线程化,将负载平均分配给系统中的每个线程。
这为负载分配的线程数提供了几乎线性的加速:
无线程:
〜11%CPU〜5MB / s磁盘
--- 69.15089249610901秒---
具有线程:4个线程
22%CPU 13MB / s磁盘
--- 29.21335482597351秒---
具有线程:8个线程
27%CPU 15MB / s磁盘
--- 20.8521249294281秒---
例如:
import time
from threading import Thread
def random_files(i):
filepath = path+"/content/%s.html" %(str(i))
fileobj = open(filepath,"w+")
l1 = "---\n"
l2 = 'title: "test"\n'
l3 = "date: 2019-05-01T18:37:07+05:30"+"\n"
l4 = "draft: false"+"\n"
l5 = 'type: "statecity"'+"\n"
l6 = "---"+"\n"
data = l1+l2+l3+l4+l5+l6
fileobj.writelines(data)
fileobj.close()
def pool(start,number):
for i in range(int(start),int(start+number)):
random_files(i)
if __name__ == "__main__":
start_time = time.time()
num_files = 50000
threads = 8
batch_size = num_files/threads
thread_list = [Thread(name='random_files', target=pool, args=(batch_size * thread_index ,batch_size)) for thread_index in range(threads)]
[t.start() for t in thread_list]
[t.join() for t in thread_list] // simply required to wait for each of the threads to finish before stopping the timer
print("--- %s seconds ---" % (time.time() - start_time))
但是,此处提供的解决方案仅是一个示例,用于说明可以实现的速度提高。仅将50,000个文件分成8个批处理(每个线程一个),该方法只能将文件分为几批,pool()
函数将需要更强大的解决方案,以将负载分为几批。 / p>
以this SO example为例,尝试在线程之间分配不均匀负载。
希望这会有所帮助!