我正在尝试使用python学习多处理。 我编写了一个简单的代码,它应该从txt输入文件中为每个进程提供1000行代码。我的main函数读取一行,将其拆分,然后对字符串中的元素执行一些非常简单的操作。最终结果应该写在输出文件中。
当我运行它时,正确生成了4个进程,但实际上只有一个进程以最小的CPU运行。因此,代码非常慢,并且首先违背了使用多处理的目的。 我认为我没有像这个问题(python multiprocessing apply_async only uses one process)那样的全局列表问题,我认为我的函数不像在这种情况下那样微不足道(Python multiprocessing.Pool() doesn't use 100% of each CPU)。
我无法理解我做错了什么,感谢任何帮助/建议。这是基本代码:
import multiprocessing
import itertools
def myfunction(line):
returnlist=[]
list_of_elem=line.split(",")
elem_id=list_of_elem[1]
elem_to_check=list_of_elem[5]
ids=list_of_elem[2].split("|")
for x in itertools.permutations(ids,2):
if x[1] == elem_to_check:
returnlist.append(",".join([elem_id,x,"1\n"]))
else:
returnlist.append(",".join([elem_id,x,"0\n"]))
return returnlist
def grouper(n, iterable, padvalue=None):
return itertools.izip_longest(*[iter(iterable)]*n, fillvalue=padvalue)
if __name__ == '__main__':
my_data = open(r"my_input_file_to_be_processed.txt","r")
my_data = my_data.read().split("\n")
p = multiprocessing.Pool(4)
for chunk in grouper(1000, my_data):
results = p.map(myfunction, chunk)
for r in results:
with open (r"my_output_file","ab") as outfile:
outfile.write(r)
修改 我按照建议修改了我的代码(删除冗余数据预处理)。但问题似乎仍然存在。
import multiprocessing
import itertools
def myfunction(line):
returnlist=[]
list_of_elem=line.split(",")
elem_id=list_of_elem[1]
elem_to_check=list_of_elem[5]
ids=list_of_elem[2].split("|")
for x in itertools.permutations(ids,2):
if x[1] == elem_to_check:
returnlist.append(",".join([elem_id,x,"1\n"]))
else:
returnlist.append(",".join([elem_id,x,"0\n"]))
return returnlist
if __name__ == '__main__':
my_data = open(r"my_input_file_to_be_processed.txt","r")
p = multiprocessing.Pool(4)
results = p.map(myfunction, chunk, chunksize=1000)
for r in results:
with open (r"my_output_file","ab") as outfile:
outfile.write(r)
答案 0 :(得分:0)
根据你的代码片段,我想我会做这样的事情,将文件分为8个部分,然后由4个工人完成计算(为什么8个块和4个工人?只是随机选择我为这个例子做了。):
from multiprocessing import Pool
import itertools
def myfunction(lines):
returnlist = []
for line in lines:
list_of_elem = line.split(",")
elem_id = list_of_elem[1]
elem_to_check = list_of_elem[5]
ids = list_of_elem[2].split("|")
for x in itertools.permutations(ids,2):
returnlist.append(",".join(
[elem_id,x,"1\n" if x[1] == elem_to_check else "0\n"]))
return returnlist
def chunk(it, size):
it = iter(it)
return iter(lambda: tuple(itertools.islice(it, size)), ())
if __name__ == "__main__":
my_data = open(r"my_input_file_to_be_processed.txt","r")
my_data = my_data.read().split("\n")
prep = [strings for strings in chunk(my_data, round(len(my_data) / 8))]
with Pool(4) as p:
res = p.map(myfunction, prep)
result = res.pop(0)
_ = list(map(lambda x: result.extend(x), res))
print(result) # ... or do something with the result
修改: 这是假设您确信所有行都以相同的方式格式化并且不会导致错误。
根据您的评论,通过在没有multiprocessing
的情况下测试它或者以非常大/丑陋的方式使用try / except来查看函数/文件内容中的问题可能很有用确保将生成输出(异常或结果):
def myfunction(lines):
returnlist = []
for line in lines:
try:
list_of_elem = line.split(",")
elem_id = list_of_elem[1]
elem_to_check = list_of_elem[5]
ids = list_of_elem[2].split("|")
for x in itertools.permutations(ids,2):
returnlist.append(",".join(
[elem_id,x,"1\n" if x[1] == elem_to_check else "0\n"]))
except Exception as err:
returnlist.append('I encountered error {} on line {}'.format(err, line))
return returnlist