如何在pandas数据帧上使用pool.starmap()?

时间:2017-11-09 14:33:25

标签: python pandas multiprocessing

this post上采取第二个答案,我尝试了以下代码

from multiprocessing import Pool
import numpy as np
from itertools import repeat
import pandas as pd

def doubler(number, r):
    result = number * 2 + r
    return result

def f1():
    return np.random.randint(20)

if __name__ == '__main__':
    df = pd.DataFrame({"A": [10,20,30,40,50,60], "B": [-1,-2,-3,-4,-5,-6]})
    num_chunks = 3
    # break df into 3 chunks
    chunks_dict = {i:np.array_split(df, num_chunks)[i] for i in range(num_chunks)}

    arg1 = f1()

    with Pool() as pool:
        results = pool.starmap(doubler, [zip(chunks_dict[i]['B'], repeat(arg1)) for i in range(num_chunks)])

    print(results)

>>> [(-1, 20, -1, 20, -2, 20), (-3, 20, -3, 20, -4, 20), (-5, 20, -5, 20, -6, 20)]

这不是我想要的结果。我想要的是将Bdf的每个元素都提供给doubler函数,以及f1的输出 - 这就是我使用{{}的原因1}}和starmap - 获取输入的列表输出加倍并添加一些随机整数。

例如,如果repeat的输出为2,那么我想返回

f1

有人能告诉我如何实现这个理想的结果吗?感谢

编辑:插入整个数据框也不起作用:

>>> [0,-2,-4,-6,-8,-10] # [2*(-1) + 2, 2*(-2) + 2, ... ]

基本上,我只想将我的数据框分解为块,并将这些块以及其他变量(arg1)放入一个接受多个参数的函数中。

1 个答案:

答案 0 :(得分:2)

你的论点看起来并不正确。例如,在print中添加doubler个参数后,我看到以下内容(假设f1()返回2):

doubler number (-3, 2) r (-4, 2)
doubler number (-1, 2) r (-2, 2)
doubler number (-5, 2) r (-6, 2)

这是因为传递到starmap的参数是zipped,而不是tuples列表。

我认为重写分块过程和参数生成要容易得多。假设我正确理解了这一点,您希望为参数生成以下元组列表(假设f1()返回2):

[( - 1,2),( - 2,2),( - 3,2),( - 4,2),( - 5,2),( - 6,2)]

然后,这将应用于doubler函数,以便starmap返回[doubler(-1, 2), doubler(-2, 2),...doubler(-6, 2)] [[0, -2, -4, -6, -8, -10] from multiprocessing import Pool import numpy as np from itertools import repeat import pandas as pd def doubler(number, r): result = number * 2 + r return result def f1(): return np.random.randint(20) if __name__ == '__main__': df = pd.DataFrame({"A": [10, 20, 30, 40, 50, 60], "B": [-1, -2, -3, -4, -5, -6]}) num_processes = 3 # the "r" value to use with every "B" value random_r = f1() # zip together a list of tuples of each B value and the random r value tuples = [(b, r) for b, r in zip(df.B.values, repeat(random_r, len(df.B.values)))] print(tuples) with Pool(num_processes) as pool: results = pool.starmap(doubler, tuples) print(results) 。试试这个:

{{1}}