用groupby熊猫并行循环并追加到字典条目

时间:2020-10-29 00:58:10

标签: python pandas multithreading parallel-processing joblib

如何在python中并行化此循环?

import pandas as pd

def my_func(tup):
    return {tup[0][1]: tup[1]['col3'].sum()}

arr = [['a','c',3],
        ['b','d',5],
        ['b','d',6],
        ['a','b',1],
        ['a','c',2],
        ['a','b',4]]

df = pd.DataFrame(arr, columns=['col1', 'col2', 'col3'])

return_dict = {}
for i in df.col1.unique():
    return_dict[i] = []

## Need to parallelize this loop
for group in df.groupby(['col1', 'col2']):
    return_dict[group[0][0]].append(my_func(group))      #group[0][0] == unique values in col1

print(return_dict)

预期输出: {'a': [{'b': 5}, {'c': 5}], 'b': [{'d': 11}]}

尝试过this,但是没有group [0] [0]问题,即字典的键不是并行函数的返回值。

我尝试了以下操作,其中我要依次进行col1值。

import pandas as pd
from joblib import Parallel, delayed

def my_func(tup):
    return {tup[0]: tup[1]['col3'].sum()}

arr = [['a','c',3],
        ['b','d',5],
        ['b','d',6],
        ['a','b',1],
        ['a','c',2],
        ['a','b',4]]

df = pd.DataFrame(arr, columns=['col1', 'col2', 'col3'])

return_dict = {}
for i in df.col1.unique():
    return_dict[i] = Parallel(n_jobs=-1, backend="threading")(
        map(delayed(my_func), df[df['col1']==i].groupby('col2'))
    )

print(return_dict)

有什么方法可以避免连续进行col1吗?如果没有,为什么?

0 个答案:

没有答案