我正在处理一个庞大的csv文件(超过15 GB)。我正在使用模糊匹配来提取行,但是当我检查资源监视器时,该脚本似乎仅使用1个内核,并且处理时间非常长。这是当前脚本的一个例子
with open('output.txt', 'w', newline='', encoding = 'utf8') as fw:
writer = csv.writer(fw, delimiter = ',',lineterminator = '\n')
for chunk in pd.read_csv("inputfile.csv", chunksize =10000, sep = ','):
for index,row in chunk.iterrows():
if (fuzz.token_set_ratio("search_terms","{0}".format(row['item_name'])) >90 and row['brand'] != 'example_brand'):
print(row['item_name'],row['brand']) # this is just for visual confirmation since the script runs for hours and hours.
line = (row['id'],row['brand'],row['item_name'])
writer.writerow(line)
我想进行设置,以便使用multiprocessing.pool将块分配到多个进程,但是我对python还是很陌生,并且在示例和使它正常工作之前没有任何运气。下面的脚本挂接了所有4个cpu内核,并且似乎正在制作一堆进程,然后立即终止它们,而据我所知没有做任何事情。有人知道为什么它会这样,以及如何使其正常运行吗?
def fuzzcheck(chunk):
for index,row in chunk.iterrows():
if (fuzz.token_set_ratio("search_terms","{0}".format(row['item_name'])) >90 and row['brand'] != "example_brand"):
print(row['item_name'],row['brand'])
line = (row['ID'],row['brand'],row['item_name'])
writer.writerow(line)
with mp.Pool(4) as pool, open('output.txt', 'w', newline='', encoding = 'utf8') as fw:
writer = csv.writer(fw, delimiter = ',',lineterminator = '\n')
for chunk in pd.read_csv("inputfile.csv", chunksize =10000, sep = ','):
pool.apply(fuzzcheck, chunk)
答案 0 :(得分:0)
答案包含在这里:No multiprocessing print outputs (Spyder)
结果证明,除非在新窗口中启动,否则Spyder不会运行多处理程序。