与多个类实例并行化熊猫

时间:2018-06-21 07:25:18

标签: python python-3.x pandas multiprocessing

我试图弄清楚如何在多个内核上运行一个大问题。我正在努力将数据帧拆分为不同的进程。

我的课程如下:

class Pergroup():
    def __init__(self, groupid):
    ...

    def process_datapoint(self, df_in, group):
    ...

我的数据是一个时间序列,并且包含可以使用groupid列进行分组的事件。我为每个组创建类的实例:

for groupname in df_in['groupid'].unique():
    instance_names.append(groupname)

holder = {name: Pergroup(name) for name in instance_names}

现在,对于数据帧中的每个时间戳,我想调用相应的实例(基于组),并将该时间戳下的数据帧传递给它。

我尝试了以下方法,但似乎并没有达到我的预期:

 for val in range(0, len(df_in)):
     current_group = df_in['groupid'][val]
     current_df = df_in.ix[val]
     with concurrent.futures.ProcessPoolExecutor() as executor:
         executor.map(holder[current_group].process_datapoint, current_df, current_group)

我也尝试过使用它,在调用实例时将df分为几列:

Parallel(n_jobs=-1)(map(delayed(holder[current_group].process_datapoint), current_df, current_group))

我应该如何拆分数据帧,以便仍可以使用正确的数据调用正确的实例?基本上,我正在尝试如下运行循环,最后一行并行运行:

for val in range(0, len(df_in)):
    current_group = df_in['groupid'][val]
    current_df = df_in.ix[val]
    holder[current_group].process_datapoint(current_df, current_group) #This call should be initiated in as many cores as possible.

2 个答案:

答案 0 :(得分:0)

使用pyautogui.pixelMatchesColor(x, y, (R, G, B), tolerance=5)

的方法略有不同
pool

答案 1 :(得分:0)

在某些时候,我遇到了类似的问题;因为我可以完全适应您的问题,所以希望您可以换位并使其适合您的问题:

    import multiprocessing
    from joblib import Parallel, delayed

    maxbatchsize = 10000 #limit the amount of data dispatched to each core 
    ncores = -1 #number of cores to use
    data = pandas.DataFrame() #<<<- your dataframe

    class DFconvoluter():
         def __init__(self, myparam):
              self.myparam = myparam
         def __call__(self, df):
              return df.apply(lamda row: row['somecolumn']*self.myparam)

    nbatches = max(math.ceil(len(df)/maxbatchsize), ncores)
    g = GenStrategicGroups( data['Key'].values, nbatches ) #a vector telling which row should be dispatched to which batch

    #-- parallel part
    def applyParallel(dfGrouped, func):
        retLst = Parallel(n_jobs=ncores)(delayed(func)(group) for _, group in dfGrouped)
        return pd.concat(retLst)
    out = applyParallel(data.groupby(g), Dfconvoluter(42)))'

剩下的就是写,您想如何将批处理分组在一起,对我而言,这必须以一种方式来完成,以便行,其中“键”列中的值必须保持相似:

def GenStrategicGroups(stratify, ngroups):
    ''' Generate a list of integers in a grouped sequence, 
    where grouped levels in stratifiy are preserved.
    '''
    g = []
    nelpg = float(len(stratify)) / ngroups

    prev_ = None
    grouped_idx = 0
    for i,s in enumerate(stratify):
        if i > (grouped_idx+1)*nelpg: 
            if s != prev_:
                grouped_idx += 1
        g.append(grouped_idx)
        prev_ = s
    return g