我正在尝试合并两个没有索引的pandas数据帧:
In [127]: df1
Out[127]:
value1 date id value2 group
0 -0.2284 2012-04-01 a -0.067469 group d
1 -0.4875 2012-04-01 b -0.021274 group d
2 0.1139 2012-04-01 c -0.015978 group d
3 0.3191 2012-04-01 d 0.022634 group d
4 -0.0077 2012-04-01 e 0.000000 group d
In [128]: df2
Out[128]:
date id value2 group
23044 2012-04-01 a -0.06701001 group c
23045 2012-04-01 b -0.02128 group c
23046 2012-04-01 c 0 group c
23047 2012-04-01 d 0 group c
23048 2012-04-01 e 0 group c
In [129]: pd.merge(df1, df2, how = 'outer', on = ['date', 'id', 'value2', 'group'])
Out[129]:
value1 date id value2 group
0 -0.2284 2012-04-01 a -0.067469 group d
1 -0.4875 2012-04-01 b -0.021274 group d
2 0.1139 2012-04-01 c -0.015978 group d
3 0.3191 2012-04-01 d 0.022634 group d
4 -0.0077 2012-04-01 e 0.000000 group d
5 NaN 2012-04-01 a -0.067010 group c
6 NaN 2012-04-01 b -0.021280 group c
7 NaN 2012-04-01 c 0.000000 group c
8 NaN 2012-04-01 d 0.000000 group c
9 NaN 2012-04-01 e 0.000000 group c
这几乎是所需的输出,除了我希望组c的value1的NaN根据日期和id由组d中的value1填充。实现这一目标的正确方法是什么?
答案 0 :(得分:0)
我认为这不可避免地分为两个步骤。
要“填写”value1,无论组或值如何,您都会将任何和所有行相关联(日期,ID)。
In [5]: df3 = df2.set_index(['date', 'id']).join(
....: df1.set_index(['date', 'id'])['value1']).reset_index()
要获得最终结果,您将按所有属性列出区分行,不再将组和值集中在一起。
In [6]: pd.merge(df1, df3, how = 'outer',
....: on = ['date', 'id', 'value1', 'value2', 'group'])
Out[6]:
value1 date id value2 group
0 -0.2284 2012-04-01 a -0.067469 group_d
1 -0.4875 2012-04-01 b -0.021274 group_d
2 0.1139 2012-04-01 c -0.015978 group_d
3 0.3191 2012-04-01 d 0.022634 group_d
4 -0.0077 2012-04-01 e 0.000000 group_d
5 -0.2284 2012-04-01 a -0.067010 group_c
6 -0.4875 2012-04-01 b -0.021280 group_c
7 0.1139 2012-04-01 c 0.000000 group_c
8 0.3191 2012-04-01 d 0.000000 group_c
9 -0.0077 2012-04-01 e 0.000000 group_c