Question

我有两个数据帧（dfA和dfB），下面给出了两个示例。我想加入数据框以产生给定的结果

dfA
Id, year, B, D
1,  2010, 15, 33
1,  2011, 24, 72
1,  2012, 30, 16

dfB
Id, year, A, C
1,  2009, 100, 1
1,  2010, 75, 7
1,  2012, 60, 3
1, 2013, 42, 4

Result
Id, year, A, B, C, D
1, 2009,100, 0, 1, 0
1, 2010,75,15, 7, 33
1, 2011,0, 24, 0, 72
1, 2012,60, 30, 3, 16
1, 2013,42, 0, 4, 0

尝试

我已经尝试过pandas.merge尝试内部，外部，左右连接，但无法获得期望的结果！

result = pd.merge(dfA,dfB,on=['Id','year'], how = 'outer')

任何提示将不胜感激！

Answer 1

merge具有正确的输出，我们只需要订购sort_values

s=pd.merge(df1,df2,on=['Id','year'], how = 'outer').\
      sort_index(level=0,axis=1).sort_values(['Id', 'year']).fillna(0)
s
Out[81]: 
       A     B    C     D   year  Id
3  100.0   0.0  1.0   0.0   2009   1
0   75.0  15.0  7.0  33.0   2010   1
1    0.0  24.0  0.0  72.0   2011   1
2   60.0  30.0  3.0  16.0   2012   1
4   42.0   0.0  4.0   0.0   2013   1

Answer 2

在这种情况下，合并的替代方法是pandas concat，并置在列轴上：

(pd.concat([df1.set_index(['Id','year']),
            df.set_index(['Id','year'])],axis=1)
 .reset_index()
 .fillna(0)
.reindex(columns=['Id','year','A','B','C','D'])
)

    Id  year    A       B   C   D
0   1,  2009,   100,    0   1.0 0.0
1   1,  2010,   75,     15, 7.0 33.0
2   1,  2011,   0       24, 0.0 72.0
3   1,  2012,   60,     30, 3.0 16.0
4   1,  2013,   42,     0   4.0 0.0

Answer 3

由于Id和year列实际上用作索引，因此使它们成为索引并使用联接可能是有意义的：

dfA.set_index(['Id', 'year']).join(dfB.set_index(['Id', 'year']), how = 'outer'
              ).fillna(0).astype(int)[list('ABCD')].reset_index()

给予：

   Id  year    A   B  C   D
0   1  2009  100   0  1   0
1   1  2010   75  15  7  33
2   1  2011    0  24  0  72
3   1  2012   60  30  3  16
4   1  2013   42   0  4   0

Answer 4

`fillna`与`downcast='infer'`

一种轻松的列排序方式

result = dfA.merge(dfB, 'outer').fillna(0, downcast='infer')
key = lambda x: (x not in {'Id', 'year'}, x)
result[sorted(result, key=key)]

   Id  year    A   B  C   D
0   1  2010   75  15  7  33
1   1  2011    0  24  0  72
2   1  2012   60  30  3  16
3   1  2009  100   0  1   0
4   1  2013   42   0  4   0

`stack`和`append`

我不喜欢这样，只是在答案景观中添加颜色

dfA.set_index(['Id', 'year']).stack().append(
    dfB.set_index(['Id', 'year']).stack()
).unstack(fill_value=0).reset_index()

   Id  year    A   B  C   D
0   1  2009  100   0  1   0
1   1  2010   75  15  7  33
2   1  2011    0  24  0  72
3   1  2012   60  30  3  16
4   1  2013   42   0  4   0

Answer 5

merge产生“正确结果”。但是，NA需要填充并转换为int和有序的列。获得正确列顺序的一种方法是使用不太理想的“硬编码”，我发现它有时比使用sort_index(axis=1)或其他方法进行自动排序要好一些。

desired_col_order = ['id','year','a','b','c','d']
B.merge(A,on=['id','year'], how='outer').sort_values(['id','year'])
 .fillna(0).astype(int)[desired_col_order]

产生：

在ID和Year上合并两个数据框熊猫，其中年份缺少值

5 个答案:

`fillna`与`downcast='infer'`

`stack`和`append`

在ID和Year上合并两个数据框熊猫，其中年份缺少值

5 个答案:

fillna与downcast='infer'

stack和append

`fillna`与`downcast='infer'`

`stack`和`append`