我想将数十个数据框与“参考”数据框合并。我想合并两个数据框中都存在的列,或者相反,当它们不存在时创建一个新列。我觉得这与topic紧密相关,但是我无法弄清楚它是否适合我的情况。 另外,请注意,用于合并的密钥永远不会包含重复项。
# Reference dataframe
df = pd.DataFrame({'date_time':['2018-06-01 00:00:00','2018-06-01 00:30:00','2018-06-01 01:00:00','2018-06-01 01:30:00']})
# Dataframes to merge to reference dataframe
df1 = pd.DataFrame({'date_time':['2018-06-01 00:30:00','2018-06-01 01:00:00'],
'potato':[13,21]})
df2 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00','2018-06-01 02:30:00'],
'carrot':[14,8,32]})
df3 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00'],
'potato':[27,31]})
df = df.merge(df1, how='left', on='date_time')
df = df.merge(df2, how='left', on='date_time')
df = df.merge(df3, how='left', on='date_time')
结果是:
date_time potato_x carrot potato_y
0 2018-06-01 00:00:00 NaN NaN NaN
1 2018-06-01 00:30:00 13.0 NaN NaN
2 2018-06-01 01:00:00 21.0 NaN NaN
3 2018-06-01 01:30:00 NaN 14.0 27.0
我想要的是:
date_time potato carrot
0 2018-06-01 00:00:00 NaN NaN
1 2018-06-01 00:30:00 13.0 NaN
2 2018-06-01 01:00:00 21.0 NaN
3 2018-06-01 01:30:00 27.0 14.0
编辑(按照@sammywemmy的回答): 我不知道在导入它们之前(循环中)数据框列的名称是什么。通常,与我的参考数据框合并的数据框包含大约100列,其中90%-95%与其他数据框相同。
答案 0 :(得分:0)
从代码继续,
使用filter方法拉出与马铃薯相关的列,
沿列轴求和,
并删除包含potato_。的列。
df['potato'] = df.filter(like='potato').fillna(0).sum(axis=1)
exclude_columns = df.columns.str.contains('potato_[a-z]')
df = df.loc[:,~exclude_columns]
date_time carrot potato
0 2018-06-01 00:00:00 NaN 0.0
1 2018-06-01 00:30:00 NaN 13.0
2 2018-06-01 01:00:00 NaN 21.0
3 2018-06-01 01:30:00 14.0 27.0
答案 1 :(得分:0)
我会pd.concat
然后使用类似merge
的类似结构化数据框,如下所示:
df.merge(pd.concat([df1, df3]), on='date_time', how='left')\
.merge(df2, on='date_time', how='left')
输出:
date_time potato carrot
0 2018-06-01 00:00:00 NaN NaN
1 2018-06-01 00:30:00 13.0 NaN
2 2018-06-01 01:00:00 21.0 NaN
3 2018-06-01 01:30:00 27.0 14.0
每个评论如下:
df = pd.DataFrame({'date_time':['2018-06-01 00:00:00','2018-06-01 00:30:00','2018-06-01 01:00:00','2018-06-01 01:30:00']})
# Dataframes to merge to reference dataframe
df1 = pd.DataFrame({'date_time':['2018-06-01 00:30:00','2018-06-01 01:00:00'],
'potato':[13,21]})
df2 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00','2018-06-01 02:30:00'],
'carrot':[14,8,32]})
df3 = pd.DataFrame({'date_time':['2018-06-01 01:30:00', '2018-06-01 02:00:00'],'potato':[27,31], 'zucchini':[11,1]})
df.merge(pd.concat([df1, df3]), on='date_time', how='left').merge(df2, on='date_time', how='left')
输出:
date_time potato zucchini carrot
0 2018-06-01 00:00:00 NaN NaN NaN
1 2018-06-01 00:30:00 13.0 NaN NaN
2 2018-06-01 01:00:00 21.0 NaN NaN
3 2018-06-01 01:30:00 27.0 11.0 14.0