我有一个从懒惰的巨大csv文件中生成行的函数
def get_next_line():
with open(sample_csv,'r') as f:
for line in f:
yield line
def do_long_operation(row):
print('Do some operation that takes a long time')
我需要使用线程,这样我从上面的函数中得到的每条记录都可以调用 do_long_operation
互联网上的大多数地方都有这样的例子,我不确定我是否走在正确的道路上
import threading
thread_list = []
for i in range(8):
t = threading.Thread(target=do_long_operation, args=(get_next_row from get_next_line) )
thread_list.append(t)
for thread in thread_list:
thread.start()
for thread in thread_list:
thread.join()
我的问题是
a)我怎么开始只说有限数量的线程说8?
b)如何确保每个线程都获得一行 来自get_next_line?
答案 0 :(得分:5)
您可以使用多处理中的线程池并将您的任务映射到工作池:
from multiprocessing.pool import ThreadPool as Pool
# from multiprocessing import Pool
from random import randint
from time import sleep
def process_line(l):
print l, "started"
sleep(randint(0,3))
print l, "done"
def get_next_line():
with open("sample.csv",'r') as f:
for line in f:
yield line
f = get_next_line()
t = Pool(processes=8)
for i in f:
t.map(process_line, (i,))
t.join()
t.close()
这将创建8名工作人员,并逐一向他们提交您的行。一旦进程免费",它将被分配一个新任务。
还有一个注释掉的import语句。如果您注释掉ThreadPool并从多处理导入池,您将获得子进程而不是线程,这在您的情况下可能更有效。
哈努哈利
答案 1 :(得分:2)
使用来自多处理的Pool / ThreadPool将任务映射到工作池,以及使用队列来控制在内存中保留多少个任务(因此,如果工作进程很慢,我们就不会过多地读取巨大的csv文件):
SELECT u.id, fv1.value AS employee_id, fv2.value AS location
FROM users u
LEFT JOIN fields_values fv1 ON fv1.item_id = u.id AND fv1.field_id = x
LEFT JOIN fields_values fv2 ON fv2.item_id = u.id AND fv2.field_id = y
答案 2 :(得分:0)
def call_processing_rows_pickably(row):
process_row(row)
import csv
from multiprocessing import Pool
import time
import datetime
def process_row(row):
row_to_be_printed = str(row)+str("hola!")
print(row_to_be_printed)
class process_csv():
def __init__(self, file_name):
self.file_name = file_name
def get_row_count(self):
with open(self.file_name) as f:
for i, l in enumerate(f):
pass
self.row_count = i
def select_chunk_size(self):
if(self.row_count>10000000):
self.chunk_size = 100000
return
if(self.row_count>5000000):
self.chunk_size = 50000
return
self.chunk_size = 10000
return
def process_rows(self):
list_de_rows = []
count = 0
with open(self.file_name, 'rb') as file:
reader = csv.reader(file)
for row in reader:
print(count+1)
list_de_rows.append(row)
if(len(list_de_rows) == self.chunk_size):
p.map(call_processing_rows_pickably, list_de_rows)
del list_de_rows[:]
def start_process(self):
self.get_row_count()
self.select_chunk_size()
self.process_rows()
initial = datetime.datetime.now()
p = Pool(4)
ob = process_csv("100M_primes.csv")
ob.start_process()
final = datetime.datetime.now()
print(final-initial)
这花了我22分钟。显然,我需要有更多的改进。例如,R中的Fred库最多需要10分钟才能执行此任务。
区别在于:我首先创建一个100k行的块,然后将其传递给由线程池映射的函数(此处为4个线程)。