我有两个看起来像这样的数据集:
name Longitude Latitude continent
0 Aruba -69.982677 12.520880 North America
1 Afghanistan 66.004734 33.835231 Asia
2 Angola 17.537368 -12.293361 Africa
3 Anguilla -63.064989 18.223959 North America
4 Albania 20.049834 41.142450 Europe
另一个数据集如下:
COUNTRY GDP (BILLIONS) CODE
0 Afghanistan 21.71 AFG
1 Albania 13.40 ALB
2 Algeria 227.80 DZA
3 American Samoa 0.75 ASM
4 Andorra 4.80 AND
在这里,列name
和COUNTRY
包含国家/地区名称,但顺序不同。
如何将第二个数据帧合并为第一个数据帧,并将CODE
列添加到第一个数据帧。
必填输出:
name Longitude Latitude continent CODE
0 Aruba -69.982677 12.520880 North America NaN
1 Afghanistan 66.004734 33.835231 Asia AFG
2 Angola 17.537368 -12.293361 Africa NaN
3 Anguilla -63.064989 18.223959 North America NaN
4 Albania 20.049834 41.142450 Europe ALB
尝试:
import numpy as np
import pandas as pd
df = pd.DataFrame({'name' : ['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Albania'],
'Longitude' : [-69.982677, 66.004734, 17.537368, -63.064989, 20.049834],
'Latitude' : [12.520880, 33.835231, '-12.293361', 18.223959, 41.142450],
'continent' : ['North America','Asia','Africa','North America','Europe'] })
print(df)
df2 = pd.DataFrame({'COUNTRY' : ['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra'],
'GDP (BILLIONS)' : [21.71, 13.40, 227.80, 0.75, 4.80],
'CODE' : ['AFG', 'ALB', 'DZA', 'ASM', 'AND']})
print(df2)
pd.merge(left=df, right=df2,left_on='name',right_on='COUNTRY')
# but this fails
答案 0 :(得分:2)
默认情况下,pd.merge
使用how='inner'
,该how='left'
在两个数据框中使用键的交点。在这里,您需要left
才能仅使用res = pd.merge(df, df2, how='left', left_on='name', right_on='COUNTRY')
数据框中的键:
mod.setParams(argParams, auxParams, allowMissing=true)
答案 1 :(得分:1)
默认情况下,合并会执行“内部”合并或联接,仅保留左右两边都匹配的记录。您需要一个“外部”联接,保留所有记录(也有“左”或“右”)。
示例:
import pandas as pd
df1 = pd.DataFrame({
'name': ['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Albania'],
'Longitude': [-69.982677, 66.004734, 17.537368, -63.064989, 20.049834],
'Latitude': [12.520880, 33.835231, '-12.293361', 18.223959, 41.142450],
'continent': ['North America', 'Asia', 'Africa', 'North America', 'Europe']
})
print(df1)
df2 = pd.DataFrame({
'COUNTRY': ['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra'],
'GDP (BILLIONS)': [21.71, 13.40, 227.80, 0.75, 4.80],
'CODE': ['AFG', 'ALB', 'DZA', 'ASM', 'AND']
})
print(df2)
# merge, using 'outer' to avoid losing records from either left or right
df3 = pd.merge(left=df1, right=df2, left_on='name', right_on='COUNTRY', how='outer')
# combining the columns used to match
df3['name'] = df3.apply(lambda row: row['name'] if not pd.isnull(row['name']) else row['COUNTRY'], axis=1)
# dropping the now spare column
df3 = df3.drop('COUNTRY', axis=1)
print(df3)
答案 2 :(得分:1)
熊猫具有pd.merge [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html]功能,默认情况下使用内部联接。如果两个数据帧中要合并的键都不相同,则内部联接基本上只采用在on
或left_on
和right_on
中指定的两个键中都存在的那些值。>
由于您需要添加CODE
值,因此可以使用以下代码行:
pd.merge(left=df, right=df2[['COUNTRY', 'CODE']], left_on='name', right_on='COUNTRY', how='left')
这将提供以下输出:
name Longitude Latitude continent COUNTRY CODE
0 Aruba -69.982677 12.520880 North America NaN NaN
1 Afghanistan 66.004734 33.835231 Asia Afghanistan AFG
2 Angola 17.537368 -12.293361 Africa NaN NaN
3 Anguilla -63.064989 18.223959 North America NaN NaN
4 Albania 20.049834 41.142450 Europe Albania ALB
以下结果也相同:
new_df = pd.merge(left=df1[['COUNTRY', 'CODE']], right=df, left_on='COUNTRY', right_on='name', how='right')
COUNTRY CODE name Longitude Latitude continent
0 Afghanistan AFG Afghanistan 66.004734 33.835231 Asia
1 Albania ALB Albania 20.049834 41.142450 Europe
2 NaN NaN Aruba -69.982677 12.520880 North America
3 NaN NaN Angola 17.537368 -12.293361 Africa
4 NaN NaN Anguilla -63.064989 18.223959 North America