我的程序列出并读取目录中的所有文件,并同时计算文件中存在的记录总数。
当我运行下面的代码时,我得到一些工作线程名称列表,其中计数来自块,因为来自多个文件的记录计数也是并行的。
import multiprocessing as mp
import time
import os
path = '/home/vaibhav/Desktop/Input_python'
def process_line(f):
print(mp.current_process())
#print("process id = " , os.getpid(f))
print(sum(1 for line in f))
for filename in os.listdir(path):
print(filename)
if __name__ == "__main__":
with open('/home/vaibhav/Desktop/Input_python/'+ filename, "r+") as source_file:
# chunk the work into batches
p = mp.Pool()
results = p.map(process_line, source_file)
start_time = time.time()
print("My program took", time.time() - start_time, "to run")
当前输出
<ForkProcess(ForkPoolWorker-54, started daemon)>
73
<ForkProcess(ForkPoolWorker-55, started daemon)>
<ForkProcess(ForkPoolWorker-56, started daemon)>
<ForkProcess(ForkPoolWorker-53, started daemon)>
73
1
<ForkProcess(ForkPoolWorker-53, started daemon)>
79
<ForkProcess(ForkPoolWorker-54, started daemon)>
<ForkProcess(ForkPoolWorker-56, started daemon)>
<ForkProcess(ForkPoolWorker-55, started daemon)>
79
77
77
有没有方法可以获得像
这样的文件的总记录数File1.Txt Total_Recordcount
...
Filen.txt Total_Recordcount
更新 我得到了解决方案,并在评论部分粘贴了答案。
答案 0 :(得分:0)
文本文件中的计数行不应受CPU限制,因此它不适合线程化。您可能希望使用线程池来处理多个独立文件,但对于单个文件,这是一种计算应该非常快的行的方法:
Fragment page = getChildFragmentManager().findFragmentByTag("android:switcher:" +
R.id.pager + ":" + mViewPager.getCurrentItem());
NotificationFragment fragment = (NotificationFragment) page;
这样做是将第一个字符(import pandas as pd
data = pd.read_table(source_file, dtype='S1', header=None, usecols=[0])
count = len(data)
)解析为DataFrame,然后检查长度。解析器是用C实现的,因此不需要缓慢的Python循环。这应该提供接近最佳速度,仅受磁盘子系统的限制。
这完全避开了原始问题,因为现在每个文件都有一个计数。
答案 1 :(得分:0)
早些时候我正在读取文件并一次为一个文件生成多个进程,这会导致文件块的记录数。
但现在我改变了我的方法,目前我正在将一个文件列表作为可迭代传递给pool.map()函数,该函数为列表中的所有不同文件释放多个进程,并给出了更好的结果运行。以下是link我从哪里获取参考,下面是粘贴和更正的代码。
import multiprocessing as mp
from multiprocessing import Pool
import os
import time
folder = '/home/vaibhav/Desktop/Input_python'
fnames = (name for name in os.listdir(folder))
def file_wc(fname):
with open('/home/vaibhav/Desktop/Input_python/'+ fname) as f:
count = sum(1 for line in f)
return (fname,count)
pool = Pool()
print(dict(pool.map(file_wc, list(fnames))))
pool.close()
pool.join()
start_time = time.time()
print("My program took", time.time() - start_time, "to run")