多处理熊猫数据框块

时间:2018-07-26 20:18:49

标签: python pandas multiprocessing bigdata

我正在处理一个庞大的csv文件(超过15 GB)。我正在使用模糊匹配来提取行,但是当我检查资源监视器时,该脚本似乎仅使用1个内核,并且处理时间非常长。这是当前脚本的一个例子

with open('output.txt', 'w', newline='', encoding = 'utf8') as fw:
    writer = csv.writer(fw, delimiter = ',',lineterminator = '\n')
    for chunk in pd.read_csv("inputfile.csv", chunksize =10000, sep = ','):
        for index,row in chunk.iterrows():
            if (fuzz.token_set_ratio("search_terms","{0}".format(row['item_name'])) >90 and row['brand'] != 'example_brand'):
                print(row['item_name'],row['brand']) # this is just for visual confirmation since the script runs for hours and hours.
                line = (row['id'],row['brand'],row['item_name'])
                writer.writerow(line)

我想进行设置,以便使用multiprocessing.pool将块分配到多个进程,但是我对python还是很陌生,并且在示例和使它正常工作之前没有任何运气。下面的脚本挂接了所有4个cpu内核,并且似乎正在制作一堆进程,然后立即终止它们,而据我所知没有做任何事情。有人知道为什么它会这样,以及如何使其正常运行吗?

def fuzzcheck(chunk):
    for index,row in chunk.iterrows():
        if (fuzz.token_set_ratio("search_terms","{0}".format(row['item_name'])) >90 and row['brand'] != "example_brand"):
            print(row['item_name'],row['brand'])
            line = (row['ID'],row['brand'],row['item_name'])
            writer.writerow(line)

with mp.Pool(4) as pool, open('output.txt', 'w', newline='', encoding = 'utf8') as fw:
    writer = csv.writer(fw, delimiter = ',',lineterminator = '\n')
    for chunk in pd.read_csv("inputfile.csv", chunksize =10000, sep = ','):
        pool.apply(fuzzcheck, chunk)

1 个答案:

答案 0 :(得分:0)

答案包含在这里:No multiprocessing print outputs (Spyder)

结果证明,除非在新窗口中启动,否则Spyder不会运行多处理程序。