Question

我试图弄清楚如何在多个内核上运行一个大问题。我正在努力将数据帧拆分为不同的进程。

我的课程如下：

class Pergroup():
    def __init__(self, groupid):
    ...

    def process_datapoint(self, df_in, group):
    ...

我的数据是一个时间序列，并且包含可以使用groupid列进行分组的事件。我为每个组创建类的实例：

for groupname in df_in['groupid'].unique():
    instance_names.append(groupname)

holder = {name: Pergroup(name) for name in instance_names}

现在，对于数据帧中的每个时间戳，我想调用相应的实例（基于组），并将该时间戳下的数据帧传递给它。

我尝试了以下方法，但似乎并没有达到我的预期：

 for val in range(0, len(df_in)):
     current_group = df_in['groupid'][val]
     current_df = df_in.ix[val]
     with concurrent.futures.ProcessPoolExecutor() as executor:
         executor.map(holder[current_group].process_datapoint, current_df, current_group)

我也尝试过使用它，在调用实例时将df分为几列：

Parallel(n_jobs=-1)(map(delayed(holder[current_group].process_datapoint), current_df, current_group))

我应该如何拆分数据帧，以便仍可以使用正确的数据调用正确的实例？基本上，我正在尝试如下运行循环，最后一行并行运行：

for val in range(0, len(df_in)):
    current_group = df_in['groupid'][val]
    current_df = df_in.ix[val]
    holder[current_group].process_datapoint(current_df, current_group) #This call should be initiated in as many cores as possible.

Answer 1

使用pyautogui.pixelMatchesColor(x, y, (R, G, B), tolerance=5)

的方法略有不同

pool

Answer 2

在某些时候，我遇到了类似的问题；因为我可以完全适应您的问题，所以希望您可以换位并使其适合您的问题：

    import multiprocessing
    from joblib import Parallel, delayed

    maxbatchsize = 10000 #limit the amount of data dispatched to each core 
    ncores = -1 #number of cores to use
    data = pandas.DataFrame() #<<<- your dataframe

    class DFconvoluter():
         def __init__(self, myparam):
              self.myparam = myparam
         def __call__(self, df):
              return df.apply(lamda row: row['somecolumn']*self.myparam)

    nbatches = max(math.ceil(len(df)/maxbatchsize), ncores)
    g = GenStrategicGroups( data['Key'].values, nbatches ) #a vector telling which row should be dispatched to which batch

    #-- parallel part
    def applyParallel(dfGrouped, func):
        retLst = Parallel(n_jobs=ncores)(delayed(func)(group) for _, group in dfGrouped)
        return pd.concat(retLst)
    out = applyParallel(data.groupby(g), Dfconvoluter(42)))'

剩下的就是写，您想如何将批处理分组在一起，对我而言，这必须以一种方式来完成，以便行，其中“键”列中的值必须保持相似：

def GenStrategicGroups(stratify, ngroups):
    ''' Generate a list of integers in a grouped sequence, 
    where grouped levels in stratifiy are preserved.
    '''
    g = []
    nelpg = float(len(stratify)) / ngroups

    prev_ = None
    grouped_idx = 0
    for i,s in enumerate(stratify):
        if i > (grouped_idx+1)*nelpg: 
            if s != prev_:
                grouped_idx += 1
        g.append(grouped_idx)
        prev_ = s
    return g

与多个类实例并行化熊猫

2 个答案: