并行合并两个数据框

时间:2020-01-04 21:20:33

标签: python-3.x merge multiprocessing

我正在尝试使用python 3多重处理通过以下代码合并两个数据框:

def parallelize_merging(df1, df2, func):


    pool = Pool(num_cores)
    df = pool.map(func, [[df1, df2]])
    pool.close()
    pool.join()
    return df



def merge_two_dataframes(data1, data2):

    return pd.merge(data1, data2, on='destination_ip', how='left').fillna('', inplace=True)

在我的主机中,我按以下方式调用这两个函数:

new_df = parallelize_merging(df1, df2, merge_two_dataframes)

但是,出现以下错误:

TypeError: Can only merge Series or DataFrame objects, a <class 'list'> was passed
"""

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<timed exec> in process_file(conn_log_file)

<timed exec> in parallelize_merging(df1, df2, func)

~/anaconda3/lib/python3.7/multiprocessing/pool.py in starmap(self, func, iterable, chunksize)
275         '''
--> 276         return self._map_async(func, iterable, starmapstar, chunksize).get()
277 

~/anaconda3/lib/python3.7/multiprocessing/pool.py in get(self, timeout)
656         else:
--> 657             raise self._value
658 

TypeError: Can only merge Series or DataFrame objects, a <class 'list'> was passed

我不确定为什么要说已经传递了一个“列表”对象,尽管两个变量“ df1”和“ df2”肯定是“ DataFrame”对象。

有人可以帮助修复以上代码以使其正常工作吗?谢谢。

0 个答案:

没有答案