关于熊猫合并的困惑

时间:2013-04-15 22:14:23

标签: merge pandas

我正在尝试合并两个没有索引的pandas数据帧:

In [127]: df1
Out[127]: 
   value1        date id    value2    group
0 -0.2284  2012-04-01  a -0.067469  group d
1 -0.4875  2012-04-01  b -0.021274  group d
2  0.1139  2012-04-01  c -0.015978  group d
3  0.3191  2012-04-01  d  0.022634  group d
4 -0.0077  2012-04-01  e  0.000000  group d

In [128]: df2
Out[128]: 
             date id      value2    group
23044  2012-04-01  a -0.06701001  group c
23045  2012-04-01  b    -0.02128  group c
23046  2012-04-01  c           0  group c
23047  2012-04-01  d           0  group c
23048  2012-04-01  e           0  group c

In [129]: pd.merge(df1, df2, how = 'outer', on = ['date', 'id', 'value2', 'group'])
Out[129]: 
   value1        date id    value2    group
0 -0.2284  2012-04-01  a -0.067469  group d
1 -0.4875  2012-04-01  b -0.021274  group d
2  0.1139  2012-04-01  c -0.015978  group d
3  0.3191  2012-04-01  d  0.022634  group d
4 -0.0077  2012-04-01  e  0.000000  group d
5     NaN  2012-04-01  a -0.067010  group c
6     NaN  2012-04-01  b -0.021280  group c
7     NaN  2012-04-01  c  0.000000  group c
8     NaN  2012-04-01  d  0.000000  group c
9     NaN  2012-04-01  e  0.000000  group c

这几乎是所需的输出,除了我希望组c的value1的NaN根据日期和id由组d中的value1填充。实现这一目标的正确方法是什么?

1 个答案:

答案 0 :(得分:0)

我认为这不可避免地分为两个步骤。

要“填写”value1,无论组或值如何,您都会将任何和所有行相关联(日期,ID)。

In [5]: df3 = df2.set_index(['date', 'id']).join(
  ....:     df1.set_index(['date', 'id'])['value1']).reset_index()

要获得最终结果,您将按所有属性列出区分行,不再将组和值集中在一起。

In [6]: pd.merge(df1, df3, how = 'outer', 
  ....:     on = ['date', 'id', 'value1', 'value2', 'group'])
Out[6]: 
   value1        date id    value2    group
0 -0.2284  2012-04-01  a -0.067469  group_d
1 -0.4875  2012-04-01  b -0.021274  group_d
2  0.1139  2012-04-01  c -0.015978  group_d
3  0.3191  2012-04-01  d  0.022634  group_d
4 -0.0077  2012-04-01  e  0.000000  group_d
5 -0.2284  2012-04-01  a -0.067010  group_c
6 -0.4875  2012-04-01  b -0.021280  group_c
7  0.1139  2012-04-01  c  0.000000  group_c
8  0.3191  2012-04-01  d  0.000000  group_c
9 -0.0077  2012-04-01  e  0.000000  group_c