我正在尝试处理文件(每行都是一个json文档)。文件的大小可以达到mbs到100的mbs。所以我写了一个生成器代码来逐行从文件中获取每个文档。
def jl_file_iterator(file):
with codecs.open(file, 'r', 'utf-8') as f:
for line in f:
document = json.loads(line)
yield document
我的系统有4个核心,所以我想并行处理4行文件。目前我有这个代码,一次占用4行,并调用代码进行并行处理
threads = 4
files, i = [], 1
for jl in jl_file_iterator(input_path):
files.append(jl)
if i % (threads) == 0:
# pool.map(processFile, files)
parallelProcess(files, o)
files = []
i += 1
if files:
parallelProcess(files, o)
files = []
这是实际处理发生的代码
def parallelProcess(files, outfile):
processes = []
for i in range(len(files)):
p = Process(target=processFile, args=(files[i],))
processes.append(p)
p.start()
for i in range(len(files)):
processes[i].join()
def processFile(doc):
extractors = {}
... do some processing on doc
o.write(json.dumps(doc) + '\n')
正如您所看到的,在我发送接下来的4个文件进行处理之前,我等待所有4行完成处理。但我想要做的是,只要一个进程完成处理文件,我就想开始下一行分配给已重新处理的处理器。我怎么做?
PS:问题是因为它是一个生成器我无法加载所有文件并使用map之类的东西来运行进程。
感谢您的帮助
答案 0 :(得分:11)
正如@pvg在评论中所说,一个(有界)队列是以不同速度在生产者和消费者之间进行调解的自然方式,确保他们尽可能地保持忙碌但不让生产者领先一步。
这是一个独立的可执行示例。队列限制为最大大小等于工作进程数。如果消费者的运行速度比生产者快得多,那么让队列变得更大就更有意义了。
在您的具体情况下,将行传递给消费者并让他们并行执行document = json.loads(line)
部分可能是有意义的。
import multiprocessing as mp
NCORE = 4
def process(q, iolock):
from time import sleep
while True:
stuff = q.get()
if stuff is None:
break
with iolock:
print("processing", stuff)
sleep(stuff)
if __name__ == '__main__':
q = mp.Queue(maxsize=NCORE)
iolock = mp.Lock()
pool = mp.Pool(NCORE, initializer=process, initargs=(q, iolock))
for stuff in range(20):
q.put(stuff) # blocks until q below its max size
with iolock:
print("queued", stuff)
for _ in range(NCORE): # tell workers we're done
q.put(None)
pool.close()
pool.join()
答案 1 :(得分:6)
所以我最终成功地运行了这个。通过从我的文件创建行块并平行运行行。将它发布在这里,以便将来对某人有用。
def run_parallel(self, processes=4):
processes = int(processes)
pool = mp.Pool(processes)
try:
pool = mp.Pool(processes)
jobs = []
# run for chunks of files
for chunkStart,chunkSize in self.chunkify(input_path):
jobs.append(pool.apply_async(self.process_wrapper,(chunkStart,chunkSize)))
for job in jobs:
job.get()
pool.close()
except Exception as e:
print e
def process_wrapper(self, chunkStart, chunkSize):
with open(self.input_file) as f:
f.seek(chunkStart)
lines = f.read(chunkSize).splitlines()
for line in lines:
document = json.loads(line)
self.process_file(document)
# Splitting data into chunks for parallel processing
def chunkify(self, filename, size=1024*1024):
fileEnd = os.path.getsize(filename)
with open(filename,'r') as f:
chunkEnd = f.tell()
while True:
chunkStart = chunkEnd
f.seek(size,1)
f.readline()
chunkEnd = f.tell()
yield chunkStart, chunkEnd - chunkStart
if chunkEnd > fileEnd:
break
答案 2 :(得分:0)
The answer of Tim Peters很棒。
但是我的具体情况略有不同,因此我不得不修改他的答案以适合我的需要。在这里引用。
这会在评论中回答@CpILL的问题。
就我而言,我使用了生成器链(用于创建管道)。
在这个生成器链中,其中一个正在执行大量计算,从而减慢了整个管道的速度。
类似这样的东西:
def fast_generator1():
for line in file:
yield line
def slow_generator(lines):
for line in lines:
yield heavy_processing(line)
def fast_generator2():
for line in lines:
yield fast_func(line)
if __name__ == "__main__":
lines = fast_generator1()
lines = slow_generator(lines)
lines = fast_generator2(lines)
for line in lines:
print(line)
要使其更快,我们必须执行具有多个进程的慢速生成器。
修改后的代码如下:
import multiprocessing as mp
NCORE = 4
def fast_generator1():
for line in file:
yield line
def slow_generator(lines):
def gen_to_queue(input_q, lines):
# This function simply consume our generator and write it to the input queue
for line in lines:
input_q.put(line)
for _ in range(NCORE): # Once generator is consumed, send end-signal
input_q.put(None)
def process(input_q, output_q):
while True:
line = input_q.get()
if line is None:
output_q.put(None)
break
output_q.put(heavy_processing(line))
input_q = mp.Queue(maxsize=NCORE * 2)
output_q = mp.Queue(maxsize=NCORE * 2)
# Here we need 3 groups of worker :
# * One that will consume the input generator and put it into a queue. It will be `gen_pool`. It's ok to have only 1 process doing this, since this is a very light task
# * One that do the main processing. It will be `pool`.
# * One that read the results and yield it back, to keep it as a generator. The main thread will do it.
gen_pool = mp.Pool(1, initializer=gen_to_queue, initargs=(input_q, lines))
pool = mp.Pool(NCORE, initializer=process, initargs=(input_q, output_q))
finished_workers = 0
while True:
line = output_q.get()
if line is None:
finished_workers += 1
if finished_workers == NCORE:
break
else:
yield line
def fast_generator2():
for line in lines:
yield fast_func(line)
if __name__ == "__main__":
lines = fast_generator1()
lines = slow_generator(lines)
lines = fast_generator2(lines)
for line in lines:
print(line)
使用此实现,我们有一个多进程生成器:它的使用方式与其他生成器完全相同(如该答案的第一个示例中一样),但是所有繁重的计算都是通过多处理来完成的,从而加速了它!
答案 3 :(得分:0)
聚会迟到了。有类似的问题。生产者和消费者基本上。就像很少有人指出队列最适合解决这个问题一样。
您可以创建一个执行器池(线程或进程)并将其与信号量结合使用以确保同时接收 n 个任务。如果您的生成器提交了任何其他任务,它将被阻止,直到信号量计数器减少。
找到了现成的解决方案。看看这个Gist