我正在尝试使用python 3多重处理通过以下代码合并两个数据框:
def parallelize_merging(df1, df2, func):
pool = Pool(num_cores)
df = pool.map(func, [[df1, df2]])
pool.close()
pool.join()
return df
def merge_two_dataframes(data1, data2):
return pd.merge(data1, data2, on='destination_ip', how='left').fillna('', inplace=True)
在我的主机中,我按以下方式调用这两个函数:
new_df = parallelize_merging(df1, df2, merge_two_dataframes)
但是,出现以下错误:
TypeError: Can only merge Series or DataFrame objects, a <class 'list'> was passed
"""
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
<timed exec> in process_file(conn_log_file)
<timed exec> in parallelize_merging(df1, df2, func)
~/anaconda3/lib/python3.7/multiprocessing/pool.py in starmap(self, func, iterable, chunksize)
275 '''
--> 276 return self._map_async(func, iterable, starmapstar, chunksize).get()
277
~/anaconda3/lib/python3.7/multiprocessing/pool.py in get(self, timeout)
656 else:
--> 657 raise self._value
658
TypeError: Can only merge Series or DataFrame objects, a <class 'list'> was passed
我不确定为什么要说已经传递了一个“列表”对象,尽管两个变量“ df1”和“ df2”肯定是“ DataFrame”对象。
有人可以帮助修复以上代码以使其正常工作吗?谢谢。