我有两个数据帧(df1)和(df2),如下所示:
df1
--------------------------------
id date value1 value2
--------------------------------
12 2010-10-09 ABC Value44
13 2011-11-08 CDE Value66
14 2015-10-08 FGH Value41
13 2009-09-10 IJK Value39
14 2010-03-05 LMN Value29
15 2006-11-12 OPQ Value33
df2
--------------------------------
id date value3 value4
--------------------------------
12 2010-10-09 20 99
15 2006-11-12 50 66
16 2015-10-08 60 41
13 2011-11-08 30 39
15 2010-03-08 50 29
15 2006-11-12 50 33
16 2001-12-04 60 11
12 2009-06-10 20 21
17 2017-10-11 18 22
18 2016-11-11 23 87
我想比较两个数据帧,并找出每个数据帧的id和date列之间的匹配。如果存在匹配,那么来自两个数据帧的id,date和相应列应该成为新数据帧中的一行(即result_df)。如果不匹配但存在id和日期,则对于该id,应将所有相应的列复制到resul_df中。
最后,result_df应如下所示:
result_df
--------------------------------------------
id date value1 value2 value3 value4
--------------------------------------------
12 2010-10-09 ABC Value44 20 99
12 2009-06-10 NA NA 20 21
13 2011-11-08 CDE Value66 30 39
13 2009-09-10 IJK Value39 NA NA
14 2015-10-08 FGH Value41 NA NA
14 2010-03-05 LMN Value29 NA NA
15 2006-11-12 OPQ Value33 50 66
15 2006-11-12 OPQ Value33 50 33
15 2010-03-08 NA NA 50 29
16 2015-10-08 NA NA 60 41
16 2001-12-04 NA NA 60 11
17 2017-10-11 NA NA 18 22
18 2016-11-11 NA NA 23 87
我已经使用了.merge与内部和外部联接,但它没有按预期执行或者我没有使用正确的方法。我认为简单的解决方案是使用for循环(因为两个数据帧只有400行可供比较),但逻辑似乎让我困惑。谁能帮我这个?谢谢!
答案 0 :(得分:1)
我想也许你 正在寻找外部合并。
您可以使用pd.merge(..., how'outer')
获得所需的结果:
import pandas as pd
df1 = pd.DataFrame({'date': ['2010-10-09', '2011-11-08', '2015-10-08', '2009-09-10', '2010-03-05', '2006-11-12'], 'id': [12, 13, 14, 13, 14, 15], 'value1': ['ABC', 'CDE', 'FGH', 'IJK', 'LMN', 'OPQ'], 'value2': ['Value44', 'Value66', 'Value41', 'Value39', 'Value29', 'Value33']})
df2 = pd.DataFrame({'date': ['2010-10-09', '2006-11-12', '2015-10-08', '2011-11-08', '2010-03-08', '2006-11-12', '2001-12-04', '2009-06-10', '2017-10-11', '2016-11-11'], 'id': [12, 15, 16, 13, 15, 15, 16, 12, 17, 18], 'value3': [20, 50, 60, 30, 50, 50, 60, 20, 18, 23], 'value4': [99, 66, 41, 39, 29, 33, 11, 21, 22, 87]})
result = pd.merge(df1, df2, how='outer').sort_values(by='id')
print(result)
产量
id date value1 value2 value3 value4
0 12 2010-10-09 ABC Value44 20.0 99.0
10 12 2009-06-10 NaN NaN 20.0 21.0
1 13 2011-11-08 CDE Value66 30.0 39.0
3 13 2009-09-10 IJK Value39 NaN NaN
2 14 2015-10-08 FGH Value41 NaN NaN
4 14 2010-03-05 LMN Value29 NaN NaN
5 15 2006-11-12 OPQ Value33 50.0 66.0
6 15 2006-11-12 OPQ Value33 50.0 33.0
8 15 2010-03-08 NaN NaN 50.0 29.0
7 16 2015-10-08 NaN NaN 60.0 41.0
9 16 2001-12-04 NaN NaN 60.0 11.0
11 17 2017-10-11 NaN NaN 18.0 22.0
12 18 2016-11-11 NaN NaN 23.0 87.0