我正在尝试将https://www.cia.gov/library/publications/the-world-factbook/fields/2127.html中的表与https://www.cia.gov/library/publications/the-world-factbook/rankorder/2004rank.html合并。
因此,为了创建2个数据帧,请执行以下操作:
url = 'https://www.cia.gov/library/publications/the-world-
factbook/fields/2127.html'
url2 = 'https://www.cia.gov/library/publications/the-world-
factbook/rankorder/2004rank.html'
d = {'TOTAL FERTILITY RATE(CHILDREN BORN/WOMAN)':'TFR'}
d2 = {'Country','GDP - PER CAPITA (PPP)':'GDP (PPP)'}
df = pd.read_html(url, header=0)[0].rename(columns=d)
df2 = pd.read_html(url2, header=0)[0].rename(columns=d2)
df['TFR'] = pd.to_numeric(df['TFR'].str[:-31])
现在我从df2创建一个子数据框:
df21 = df2[['Country','GDP (PPP)']]
因此,我最终得到了df21,其中包含国家/地区名称及其GDP。现在,我想比较两个数据框,并根据其名称为df中的每个国家/地区分配GDP(PPP)值(在df和df2中都有一个包含国家/地区名称的列)。任何想法如何做到这一点?
答案 0 :(得分:1)
df3 = df.merge(df2[['Country','GDP (PPP)']], how='left')
print (df3.head())
Country TFR GDP (PPP)
0 Afghanistan 5.12 $2,000
1 Albania 1.51 $12,500
2 Algeria 2.70 $15,200
3 American Samoa 2.68 $11,200
4 Andorra 1.40 $49,900
df['GDP (PPP)'] = df['Country'].map(df2.set_index('Country')['GDP (PPP)'])
print (df.head())
Country TFR GDP (PPP)
0 Afghanistan 5.12 $2,000
1 Albania 1.51 $12,500
2 Algeria 2.70 $15,200
3 American Samoa 2.68 $11,200
4 Andorra 1.40 $49,900
如果df2['Country']
中df['Country']
中不存在来自NaN
的国家/地区值:
print (df[df['GDP (PPP)'].isna()])
Country TFR GDP (PPP)
43 Christmas Island NaN NaN
44 Cocos (Keeling) Islands NaN NaN
78 Gaza Strip 4.13 NaN
154 Norfolk Island NaN NaN
165 Pitcairn Islands NaN NaN
191 Somalia 5.80 NaN
198 Svalbard NaN NaN
230 World 2.42 NaN