如何在元素不按顺序排列的两个不同列上合并两个熊猫数据帧?

时间:2018-12-18 01:09:07

标签: python pandas

我有两个看起来像这样的数据集:

          name  Longitude   Latitude      continent
0        Aruba -69.982677  12.520880  North America
1  Afghanistan  66.004734  33.835231           Asia
2       Angola  17.537368 -12.293361         Africa
3     Anguilla -63.064989  18.223959  North America
4      Albania  20.049834  41.142450         Europe

另一个数据集如下:

          COUNTRY  GDP (BILLIONS) CODE
0     Afghanistan           21.71  AFG
1         Albania           13.40  ALB
2         Algeria          227.80  DZA
3  American Samoa            0.75  ASM
4         Andorra            4.80  AND

在这里,列nameCOUNTRY包含国家/地区名称,但顺序不同。

如何将第二个数据帧合并为第一个数据帧,并将CODE列添加到第一个数据帧。

必填输出:

          name  Longitude   Latitude      continent   CODE
0        Aruba -69.982677  12.520880  North America   NaN
1  Afghanistan  66.004734  33.835231           Asia   AFG
2       Angola  17.537368 -12.293361         Africa   NaN
3     Anguilla -63.064989  18.223959  North America   NaN
4      Albania  20.049834  41.142450         Europe   ALB

尝试:

import numpy as np
import pandas as pd

df = pd.DataFrame({'name' : ['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Albania'],
          'Longitude' : [-69.982677, 66.004734, 17.537368, -63.064989, 20.049834],
          'Latitude' : [12.520880, 33.835231, '-12.293361', 18.223959, 41.142450],
          'continent' : ['North America','Asia','Africa','North America','Europe'] })
print(df)

df2 = pd.DataFrame({'COUNTRY' :  ['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra'],
          'GDP (BILLIONS)' : [21.71, 13.40, 227.80, 0.75, 4.80],
          'CODE' : ['AFG', 'ALB', 'DZA', 'ASM', 'AND']})
print(df2)


pd.merge(left=df, right=df2,left_on='name',right_on='COUNTRY')
# but this fails

3 个答案:

答案 0 :(得分:2)

默认情况下,pd.merge使用how='inner',该how='left'在两个数据框中使用键的交点。在这里,您需要left才能仅使用res = pd.merge(df, df2, how='left', left_on='name', right_on='COUNTRY') 数据框中的键:

mod.setParams(argParams, auxParams, allowMissing=true)

答案 1 :(得分:1)

默认情况下,合并会执行“内部”合并或联接,仅保留左右两边都匹配的记录。您需要一个“外部”联接,保留所有记录(也有“左”或“右”)。

示例:

import pandas as pd

df1 = pd.DataFrame({
    'name': ['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Albania'],
    'Longitude': [-69.982677, 66.004734, 17.537368, -63.064989, 20.049834],
    'Latitude': [12.520880, 33.835231, '-12.293361', 18.223959, 41.142450],
    'continent': ['North America', 'Asia', 'Africa', 'North America', 'Europe']
})
print(df1)

df2 = pd.DataFrame({
    'COUNTRY': ['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra'],
    'GDP (BILLIONS)': [21.71, 13.40, 227.80, 0.75, 4.80],
    'CODE': ['AFG', 'ALB', 'DZA', 'ASM', 'AND']
})
print(df2)

# merge, using 'outer' to avoid losing records from either left or right
df3 = pd.merge(left=df1, right=df2, left_on='name', right_on='COUNTRY', how='outer')
# combining the columns used to match
df3['name'] = df3.apply(lambda row: row['name'] if not pd.isnull(row['name']) else row['COUNTRY'], axis=1)
# dropping the now spare column
df3 = df3.drop('COUNTRY', axis=1)
print(df3)

答案 2 :(得分:1)

熊猫具有pd.merge [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html]功能,默认情况下使用内部联接。如果两个数据帧中要合并的键都不相同,则内部联接基本上只采用在onleft_onright_on中指定的两个键中都存在的那些值。

由于您需要添加CODE值,因此可以使用以下代码行:

pd.merge(left=df, right=df2[['COUNTRY', 'CODE']], left_on='name', right_on='COUNTRY', how='left')

这将提供以下输出:

          name  Longitude   Latitude      continent      COUNTRY CODE
0        Aruba -69.982677  12.520880  North America          NaN  NaN
1  Afghanistan  66.004734  33.835231           Asia  Afghanistan  AFG
2       Angola  17.537368 -12.293361         Africa          NaN  NaN
3     Anguilla -63.064989  18.223959  North America          NaN  NaN
4      Albania  20.049834  41.142450         Europe      Albania  ALB

以下结果也相同:

new_df = pd.merge(left=df1[['COUNTRY', 'CODE']], right=df, left_on='COUNTRY', right_on='name', how='right')

       COUNTRY CODE         name  Longitude   Latitude      continent
0  Afghanistan  AFG  Afghanistan  66.004734  33.835231           Asia
1      Albania  ALB      Albania  20.049834  41.142450         Europe
2          NaN  NaN        Aruba -69.982677  12.520880  North America
3          NaN  NaN       Angola  17.537368 -12.293361         Africa
4          NaN  NaN     Anguilla -63.064989  18.223959  North America