Question

我的输入数据帧通过三种不同的用户定义操作- op1，op2，op3 进行分析。按顺序运行这些操作效率低下，因为它们彼此独立。它们全部三个都可以并行运行，因为它们只读取输入数据帧，而不向其写入任何内容。

我尝试使用Python multiprocessing模块来增强性能。因此，我将这三个操作转换为函数，并通过apply_async实现了它们：

def calculate(func, args):
   func(args) # Execute function on DataFrame

def analyze(operations, df):
  tasks = [(op, df) for op in operations]
  with multiprocessing.Pool() as pool:
        results = [pool.apply_async(calculate, t) for t in tasks] # Pass Functor and DataFrame as arguments
        for r in results:
            r.get()
        pool.close()
        pool.join()

operations = [op1, op2, op3]
analyze(operations, df)

从理论上讲，我认为我可以使性能提高 3倍，因为现在这三个操作都可以并行运行。但是，我发现性能仅提高了 1.5倍。有人可以帮助我理解为什么会这样吗？我的实现有误吗？还是有其他方法可以达到 3倍效果？

Python-对具有不同功能的Pandas数据框进行多处理

0 个答案: