以下是原始数据集来源的链接: dataset for capacity和dataset for type
或修改后的版本dataset modified1和dataset modified2
我有2个要合并的数据框:
first_df=pd.DataFrame([['2001','Abu Dhabi','100-','462'],['2001','Abu Dhabi','100','44'],['2001','Abu Dhabi','200','462'],['2001','Dubai','100-','40'],['2001','Dubai','100','30'],['2001','Dubai','200','51'],['2002','Abu Dhabi','100-','300'],['2002','Abu Dhabi','100','220'],['2002','Abu Dhabi','200','56'],['2002','Dubai','100-','55'],['2002','Dubai','100','67'],['2002','Dubai','200','89']],columns=['Year','Emirate','Capacity','Number'])
second_df=pd.DataFrame([['2001','Abu Dhabi','Performed','45'],['2001','Abu Dhabi','Not Performed','76'],['2001','Dubai','Performed','90'],['2001','Dubai','Not Performed','50'],['2002','Abu Dhabi','Performed','78'],['2002','Abu Dhabi','Not Performed','45'],['2002','Dubai','Performed','76'],['2002','Dubai','Not Performed','58']],columns=['Year','Emirate','Type','Value'])
所以我已经为两个数据帧设置了multiIndex:
first=first_df.set_index(['Year','Emirate'])
second=second_df.set_index(['Year','Emirate'])
并合并:
merged=first.merge(second,how='outer',right_index=True,left_index=True)
具有以下结果:
| Year , Emirate | Capacity | count | friday | count |
|:----------------------|:-----------|--------:|:--------------|--------:|
| ('2001', 'Abu Dhabi') | 100- | 462 | Performed | 45 |
| ('2001', 'Abu Dhabi') | 100- | 462 | Not Performed | 76 |
| ('2001', 'Abu Dhabi') | 100 | 44 | Performed | 45 |
| ('2001', 'Abu Dhabi') | 100 | 44 | Not Performed | 76 |
| ('2001', 'Abu Dhabi') | 200 | 657 | Performed | 45 |
| ('2001', 'Abu Dhabi') | 200 | 657 | Not Performed | 76 |
| ('2001', 'Dubai') | 100- | 40 | Performed | 90 |
| ('2001', 'Dubai') | 100- | 40 | Not Performed | 50 |
| ('2001', 'Dubai') | 100 | 30 | Performed | 90 |
| ('2001', 'Dubai') | 100 | 30 | Not Performed | 50 |
| ('2001', 'Dubai') | 200 | 51 | Performed | 90 |
| ('2001', 'Dubai') | 200 | 51 | Not Performed | 50 |
| ('2002', 'Abu Dhabi') | 100- | 300 | Performed | 78 |
| ('2002', 'Abu Dhabi') | 100- | 300 | Not Performed | 45 |
| ('2002', 'Abu Dhabi') | 100 | 220 | Performed | 78 |
| ('2002', 'Abu Dhabi') | 100 | 220 | Not Performed | 45 |
| ('2002', 'Abu Dhabi') | 200 | 56 | Performed | 78 |
| ('2002', 'Abu Dhabi') | 200 | 56 | Not Performed | 45 |
| ('2002', 'Dubai') | 100- | 55 | Performed | 76 |
| ('2002', 'Dubai') | 100- | 55 | Not Performed | 58 |
| ('2002', 'Dubai') | 100 | 67 | Performed | 76 |
| ('2002', 'Dubai') | 100 | 67 | Not Performed | 58 |
| ('2002', 'Dubai') | 200 | 89 | Performed | 76 |
| ('2002', 'Dubai') | 200 | 89 | Not Performed | 58 |
并尝试结合以下结果:
joined=pd.concat([first,second])
| Year , Emirate | Capacity | Number | Type | Value |
|:----------------------|:-----------|---------:|:--------------|--------:|
| ('2001', 'Abu Dhabi') | 100- | 462 | nan | nan |
| ('2001', 'Abu Dhabi') | 100 | 44 | nan | nan |
| ('2001', 'Abu Dhabi') | 200 | 657 | nan | nan |
| ('2001', 'Dubai') | 100- | 40 | nan | nan |
| ('2001', 'Dubai') | 100 | 30 | nan | nan |
| ('2001', 'Dubai') | 200 | 51 | nan | nan |
| ('2002', 'Abu Dhabi') | 100- | 300 | nan | nan |
| ('2002', 'Abu Dhabi') | 100 | 220 | nan | nan |
| ('2002', 'Abu Dhabi') | 200 | 56 | nan | nan |
| ('2002', 'Dubai') | 100- | 55 | nan | nan |
| ('2002', 'Dubai') | 100 | 67 | nan | nan |
| ('2002', 'Dubai') | 200 | 89 | nan | nan |
| ('2001', 'Abu Dhabi') | nan | nan | Performed | 45 |
| ('2001', 'Abu Dhabi') | nan | nan | Not Performed | 76 |
| ('2001', 'Dubai') | nan | nan | Performed | 90 |
| ('2001', 'Dubai') | nan | nan | Not Performed | 50 |
| ('2002', 'Abu Dhabi') | nan | nan | Performed | 78 |
| ('2002', 'Abu Dhabi') | nan | nan | Not Performed | 45 |
| ('2002', 'Dubai') | nan | nan | Performed | 76 |
| ('2002', 'Dubai') | nan | nan | Not Performed | 58 |
因此,两个数据帧连接在一起时不应具有重复项(如第一次合并)或下移(如concat变体)。 使2个数据框很好地对齐的解决方案是什么?
以下是所需输出的样子:
| | Year | Emirate | Capacity | Number | Type | Value |
|---:|-------:|:----------|:-----------|---------:|:--------------|--------:|
| 0 | | | 100- | 462 | Performed | 45 |
| 1 | | Abu Dhabi | 100 | 44 | Not Performed | 76 |
| 2 | | | 200 | 657 | NaN | nan |
| 3 | 2001 | | 100- | 40 | Performed | 90 |
| 4 | | Dubai | 100 | 30 | Not Performed | 50 |
| 5 | | | 200 | 51 | NaN | nan |
| 6 | | | 100- | 300 | Performed | 78 |
| 7 | | Abu Dhabi | 100 | 220 | Not Performed | 45 |
| 8 | 2002 | | 200 | 56 | NaN | nan |
| 9 | | | 100- | 55 | Performed | 76 |
| 10 | | Dubai | 100 | 67 | Not Performed | 58 |
| 11 | | | 200 | 89 | NaN | nan |
enter code here
答案 0 :(得分:0)
我在这里看到了问题,当您在['year','Emirate']
上进行联接时,您的数据以这种方式导致交叉联接。例如,“ 2001年阿布扎比”和“ 2001年阿布扎比”在两个数据框中均表示“已执行”和“未执行”。基本上,这是m x n个关系联接数据集。除非您指定可以唯一标识每一行的主键,否则最终将获得相同的结果。
答案 1 :(得分:0)
我认为您的数据尚不正确,因为可以实现预期的输出,但现在还不符合您的逻辑。
您在key column
中缺少第三个second_df
,即capacity
。如果我们添加此列并执行left merge
,则可以实现您的预期输出。
顺便说一句,我们不需要将列设置为索引,因此解决方案如下所示。
# Clean up and create correct dataframes
first_df=pd.DataFrame([['2001','Abu Dhabi','100-','462'],
['2001','Abu Dhabi','100','44'],
['2001','Abu Dhabi','200','657'],
['2001','Dubai','100-','40'],
['2001','Dubai','100','30'],
['2001','Dubai','200','51'],
['2002','Abu Dhabi','100-','300'],
['2002','Abu Dhabi','100','220'],
['2002','Abu Dhabi','200','56'],
['2002','Dubai','100-','55'],
['2002','Dubai','100','67'],
['2002','Dubai','200','89']],columns=['Year','Emirate','Capacity','Number'])
second_df=pd.DataFrame([['2001','Abu Dhabi','100-','Performed','45'],
['2001','Abu Dhabi','100','Not Performed','76'],
['2001','Abu Dhabi','','',''],
['2001','Dubai','100-','Performed','90'],
['2001','Dubai','100','Not Performed','50'],
['2001','Dubai','','',''],
['2002','Abu Dhabi','100-','Performed','78'],
['2002','Abu Dhabi','100','Not Performed','45'],
['2002','Abu Dhabi','', '', ''],
['2002','Dubai','100-','Performed','76'],
['2002','Dubai','100','Not Performed','58'],
['2002','Dubai', '', '', '']],columns=['Year','Emirate','Capacity','Type','Value'])
# Perform a left merge to get correct output
merged=first_df.merge(second_df,how='left',on=['Year', 'Emirate', 'Capacity'])
输出
Year Emirate Capacity Number Type Value
0 2001 Abu Dhabi 100- 462 Performed 45
1 2001 Abu Dhabi 100 44 Not Performed 76
2 2001 Abu Dhabi 200 657 NaN NaN
3 2001 Dubai 100- 40 Performed 90
4 2001 Dubai 100 30 Not Performed 50
5 2001 Dubai 200 51 NaN NaN
6 2002 Abu Dhabi 100- 300 Performed 78
7 2002 Abu Dhabi 100 220 Not Performed 45
8 2002 Abu Dhabi 200 56 NaN NaN
9 2002 Dubai 100- 55 Performed 76
10 2002 Dubai 100 67 Not Performed 58
11 2002 Dubai 200 89 NaN NaN