如何在pandas Series对象上并行化`map`函数?

时间:2016-04-28 09:42:59

标签: python pandas parallel-processing

例如,我有一个Series对象,其值是波形转储的文件名。比如,我想取每个波形的平均值。

为什么我要并行化这个?这些波形转储只能由专有软件读取。我需要调用该程序进行分析(它可以输出到stdout,所以没问题)。

在代码中,它的外观如下:

from subprocess import check_output

def get_average(filename_str):
    average = check_output(['proprietary_mean_calculator', filename_str])
    return float(average)

# waveform_dumps is a pandas Series object
waveform_averages = waveform_dumps.map(get_average)

1 个答案:

答案 0 :(得分:2)

您是否使用熊猫可能并不重要。您正在寻找的是简单的并行执行。

尝试concurrent.futures

from subprocess import check_output
import concurrent.futures
import pandas as pd

def get_average(filename_str):
    average = check_output(['proprietary_mean_calculator', filename_str])
    return float(average)

with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
    waveform_averages = executor.map(get_average, waveform_dumps)

# to make the result a pd.Series, if desired:
waveform_averages = pd.Series(waveform_averages, index=waveform_dumps.index)