我在Python中有两个表(作为DataFrames)。一个如下:
Country Year totmigrants
Afghanistan 2000
Afghanistan 2001
Afghanistan 2002
Afghanistan 2003
Afghanistan 2004
Afghanistan 2005
Afghanistan 2006
Afghanistan 2007
Afghanistan 2008
Algeria 2000
Algeria 2001
Algeria 2002
...
Zimbabwe 2008
另一个是每一年(9个单独的DataFrames整体2000-2008):
Year=2000
---------------------------------------
Country totmigrants Gender Total
Afghanistan 73 M 70
Afghanistan F 3
Albania 11 M 5
Albania F 6
Algeria 52 M 44
...
Zimbabwe F 1
我想将它们连接在一起,第一个表是外连接。 我有这个想法,但这只适用于按列合并:
new=pd.merge(table1,table2,how='left',on=['Country', 'Year'])
我想看到的是,从每个数据框的每年移民总数来看,F和M出现在第一个表的新列中:
Country Year totmigrants F M
Afghanistan 2000 73 3 70
Afghanistan 2001 table3
Afghanistan 2002 table4
Afghanistan 2003 ...
Afghanistan 2004
Afghanistan 2005
Afghanistan 2006
Afghanistan 2007
Afghanistan 2008
Algeria 2000 52 8 44
Algeria 2001 table3 ...
Algeria 2002 table4 ...
...
Zimbabwe 2008 ... ...
这种合并是否有特定的方法,或者我需要使用哪种功能?
答案 0 :(得分:1)
以下是如何组合年度数据框架中的数据。让我们假设年度数据帧以某种方式存储在字典中:
df = {2000: ..., 2001: ..., ..., 2008: ...}
yearly = []
for N in df.keys():
tmp = df[N].pivot('Country','Gender','Total').fillna(0).astype(int)
tmp['Year'] = N # Store the year
tmp['totmigrants'] = tmp['M'] + tmp['F']
yearly.append(tmp)
df = pd.concat(yearly)
print(df)
#Gender F M Year totmigrants
#Country
#Afghanistan 3 70 2000 73
#Albania 6 5 2000 11
#Algeria 0 44 2000 44
#Zimbabwe 1 0 2000 1
现在,您可以使用df
作为键,将['Country','Year']
与第一个数据框合并。
答案 1 :(得分:1)
我不确定你需要第一张桌子。我做了以下,我希望它有所帮助。
data2000 = np.array([['','Country','totmigrants','Gender', 'Total'],
['1','Afghanistan', 73, 'M', 70],
['2','Afghanistan', None, 'F', 3],
['3','Albania', 11, 'M', 5],
['4','Albania', None ,'F', 6]])
data2001 = np.array([['','Country','totmigrants','Gender', 'Total'],
['1','Afghanistan', 75, 'M', 60],
['2','Afghanistan', None, 'F', 15],
['3','Albania', 15, 'M', 11],
['4','Albania', None ,'F', 4]])
# and so on
datas = {'2000':data2000, '2001':data2001}
reg_dfs = []
for year,data in datas.items():
df = pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
new=pd.merge(df,df,how='inner',on=['Country'])
reg_df = new.query('Gender_x == "M" & Gender_y == "F"' )[['Country', 'Total_x', 'Total_y', 'totmigrants_x']]
reg_df.columns = ['Country', 'M', 'F', 'Total']
reg_df['Year'] = year
reg_dfs.append(reg_df)
print(pd.concat(reg_dfs).sort(['Country']))
# Country M F Total Year
#1 Afghanistan 70 3 73 2000
#1 Afghanistan 60 15 75 2001
#5 Albania 5 6 11 2000
#5 Albania 11 4 15 2001