Question

我有df_fruits，这是水果的数据帧。

index      name
1          apple
2          banana
3          strawberry

并且，它的市场价格在如下所示的mysql数据库中，

category      market      price
apple         A           1.0
apple         B           1.5
banana        A           1.2
banana        A           3.0
apple         C           1.8
strawberry    B           2.7        
...

在df_fruits中的迭代过程中，我想做一些处理。

下面的代码是非并行版本。

def process(fruit):
   # make DB connection
   # fetch the prices of fruit from database
   # do some processing with fetched data, which takes a long time
   # insert the result into DB
   # close DB connection

for idx, f in df_fruits.iterrows():
    process(f)

我想做的是process中并行执行df_fruits的每一行，因为df_fruits有很多行，表的大小为市场价格相当大（获取数据需要很长时间）。

如您所见，行之间的执行顺序无关紧要，也没有共享数据。

在df_fruits中的迭代中，我对在哪里可以找到`pool.map（）感到困惑。我是否需要在并行执行之前拆分行并将块分配给每个进程？（如果是这样，一个比其他进程更早完成工作的进程将处于空闲状态？）

我已经研究过pandarallel，但是我不能使用它（我的操作系统是Windows）。

任何帮助将不胜感激。

Answer 1

根本不需要使用pandas。您可以简单地使用multiprocessing包中的Pool。 Pool.map()接受两个输入：一个函数和一个值列表。

因此您可以这样做：

from multiprocessing import Pool

n = 5  # Any number of threads
with Pool(n) as p:
    p.map(process, df_fruits['name'].values)

这将一次df_fruits数据帧中的所有结果。请注意，这里没有返回任何结果，因为process函数旨在将结果写回到数据库中。

如果您想在每一行中考虑多个列，则可以将df_fruits['name'].values更改为：

df_fruits[cols].to_dict('records')

这会将字典作为preprocess的输入，例如：

{'name': 'apple', 'index': 1, ...}

Answer 2

是的，尽管没有直接在熊猫库中提供，但有可能。

也许您可以尝试这样的事情：

def do_parallel_stuff_on_dataframe(df, fn_to_execute, num_cores):
    # create a pool for multiprocessing
    pool = Pool(num_cores)

    # split your dataframe to execute on these pools
    splitted_df = np.array_split(df, num_cores)

    # execute in parallel:
    split_df_results = pool.map(fn_to_execute, splitted_df)

    #combine your results
    df = pd.concat(split_df_results)

    pool.close()
    pool.join()
    return df

Answer 3

您也许可以执行以下操作：

with Pool() as pool:
    # create an iterator that just gives you the fruit and not the idex
    rows = (f for _, f in df_fruits.iterrows())
    pool.imap(process, rows)

如果您不关心结果，或者愿意以任何顺序获取结果，或者不关心结果，则可能要使用map以外的其他池基元之一。

熊猫迭代中每一行的并行处理

3 个答案: