基于循环并比较其他数据框中的列,创建新的数据框

时间:2018-01-29 11:02:00

标签: python-2.7 pandas

我有两个数据帧(df1)和(df2),如下所示:

df1
--------------------------------
id     date    value1    value2
--------------------------------
12  2010-10-09  ABC     Value44
13  2011-11-08  CDE     Value66
14  2015-10-08  FGH     Value41
13  2009-09-10  IJK     Value39
14  2010-03-05  LMN     Value29
15  2006-11-12  OPQ     Value33

df2
--------------------------------
id     date    value3    value4
--------------------------------
12  2010-10-09  20        99
15  2006-11-12  50        66
16  2015-10-08  60        41
13  2011-11-08  30        39
15  2010-03-08  50        29
15  2006-11-12  50        33
16  2001-12-04  60        11
12  2009-06-10  20        21
17  2017-10-11  18        22
18  2016-11-11  23        87

我想比较两个数据帧,并找出每个数据帧的id和date列之间的匹配。如果存在匹配,那么来自两个数据帧的id,date和相应列应该成为新数据帧中的一行(即result_df)。如果不匹配但存在id和日期,则对于该id,应将所有相应的列复制到resul_df中。

最后,result_df应如下所示:

result_df
--------------------------------------------
id     date   value1  value2   value3 value4
--------------------------------------------
12  2010-10-09  ABC  Value44      20    99
12  2009-06-10   NA     NA        20    21  
13  2011-11-08  CDE  Value66      30    39
13  2009-09-10  IJK  Value39      NA    NA
14  2015-10-08  FGH  Value41      NA    NA
14  2010-03-05  LMN  Value29      NA    NA
15  2006-11-12  OPQ  Value33      50    66
15  2006-11-12  OPQ  Value33      50    33
15  2010-03-08   NA    NA         50    29
16  2015-10-08   NA    NA         60    41
16  2001-12-04   NA    NA         60    11
17  2017-10-11   NA    NA         18    22
18  2016-11-11   NA    NA         23    87

我已经使用了.merge与内部和外部联接,但它没有按预期执行或者我没有使用正确的方法。我认为简单的解决方案是使用for循环(因为两个数据帧只有400行可供比较),但逻辑似乎让我困惑。谁能帮我这个?谢谢!

1 个答案:

答案 0 :(得分:1)

我想也许你 正在寻找外部合并。 您可以使用pd.merge(..., how'outer')获得所需的结果:

import pandas as pd
df1 = pd.DataFrame({'date': ['2010-10-09', '2011-11-08', '2015-10-08', '2009-09-10', '2010-03-05', '2006-11-12'], 'id': [12, 13, 14, 13, 14, 15], 'value1': ['ABC', 'CDE', 'FGH', 'IJK', 'LMN', 'OPQ'], 'value2': ['Value44', 'Value66', 'Value41', 'Value39', 'Value29', 'Value33']}) 
df2 = pd.DataFrame({'date': ['2010-10-09', '2006-11-12', '2015-10-08', '2011-11-08', '2010-03-08', '2006-11-12', '2001-12-04', '2009-06-10', '2017-10-11', '2016-11-11'], 'id': [12, 15, 16, 13, 15, 15, 16, 12, 17, 18], 'value3': [20, 50, 60, 30, 50, 50, 60, 20, 18, 23], 'value4': [99, 66, 41, 39, 29, 33, 11, 21, 22, 87]})
result = pd.merge(df1, df2, how='outer').sort_values(by='id')
print(result)

产量

    id        date value1   value2  value3  value4
0   12  2010-10-09    ABC  Value44    20.0    99.0
10  12  2009-06-10    NaN      NaN    20.0    21.0 
1   13  2011-11-08    CDE  Value66    30.0    39.0
3   13  2009-09-10    IJK  Value39     NaN     NaN
2   14  2015-10-08    FGH  Value41     NaN     NaN
4   14  2010-03-05    LMN  Value29     NaN     NaN
5   15  2006-11-12    OPQ  Value33    50.0    66.0
6   15  2006-11-12    OPQ  Value33    50.0    33.0
8   15  2010-03-08    NaN      NaN    50.0    29.0
7   16  2015-10-08    NaN      NaN    60.0    41.0
9   16  2001-12-04    NaN      NaN    60.0    11.0
11  17  2017-10-11    NaN      NaN    18.0    22.0
12  18  2016-11-11    NaN      NaN    23.0    87.0