从Python中的另一个表列中连接缺少的行

时间:2018-02-05 22:37:46

标签: python pandas dataframe join merge

我在Python中有两个表(作为DataFrames)。一个如下:

Country     Year    totmigrants
Afghanistan 2000    
Afghanistan 2001    
Afghanistan 2002    
Afghanistan 2003    
Afghanistan 2004    
Afghanistan 2005    
Afghanistan 2006    
Afghanistan 2007    
Afghanistan 2008    
Algeria     2000    
Algeria     2001    
Algeria     2002
...
Zimbabwe    2008

另一个是每一年(9个单独的DataFrames整体2000-2008):

Year=2000
---------------------------------------
Country    totmigrants  Gender  Total
Afghanistan 73           M     70
Afghanistan              F     3
Albania     11           M     5
Albania                  F     6
Algeria     52           M     44
...
Zimbabwe                 F     1

我想将它们连接在一起,第一个表是外连接。 我有这个想法,但这只适用于按列合并:

new=pd.merge(table1,table2,how='left',on=['Country', 'Year'])

我想看到的是,从每个数据框的每年移民总数来看,F和M出现在第一个表的新列中:

Country     Year    totmigrants F  M
Afghanistan 2000      73       3  70
Afghanistan 2001    table3
Afghanistan 2002    table4
Afghanistan 2003    ...
Afghanistan 2004    
Afghanistan 2005    
Afghanistan 2006    
Afghanistan 2007    
Afghanistan 2008    
Algeria     2000    52          8 44
Algeria     2001    table3      ...
Algeria     2002    table4      ...
...
Zimbabwe    2008     ...        ...

这种合并是否有特定的方法,或者我需要使用哪种功能?

2 个答案:

答案 0 :(得分:1)

以下是如何组合年度数据框架中的数据。让我们假设年度数据帧以某种方式存储在字典中:

df = {2000: ..., 2001: ..., ..., 2008: ...}
yearly = []

for N in df.keys():
    tmp = df[N].pivot('Country','Gender','Total').fillna(0).astype(int)
    tmp['Year'] = N # Store the year
    tmp['totmigrants'] = tmp['M'] + tmp['F']
    yearly.append(tmp)

df = pd.concat(yearly)
print(df)
#Gender       F   M  Year  totmigrants
#Country                              
#Afghanistan  3  70  2000           73
#Albania      6   5  2000           11
#Algeria      0  44  2000           44
#Zimbabwe     1   0  2000            1

现在,您可以使用df作为键,将['Country','Year']与第一个数据框合并。

答案 1 :(得分:1)

我不确定你需要第一张桌子。我做了以下,我希望它有所帮助。

data2000 = np.array([['','Country','totmigrants','Gender', 'Total'],
['1','Afghanistan', 73, 'M', 70],
['2','Afghanistan', None, 'F', 3],
['3','Albania', 11, 'M', 5],
['4','Albania', None ,'F', 6]])

data2001 = np.array([['','Country','totmigrants','Gender', 'Total'],
['1','Afghanistan', 75, 'M', 60],
['2','Afghanistan', None, 'F', 15],
['3','Albania', 15, 'M', 11],
['4','Albania', None ,'F', 4]])

# and so on
datas = {'2000':data2000, '2001':data2001}
reg_dfs = []
for year,data in datas.items():
    df = pd.DataFrame(data=data[1:,1:],
              index=data[1:,0],
              columns=data[0,1:])

    new=pd.merge(df,df,how='inner',on=['Country'])
    reg_df = new.query('Gender_x == "M" & Gender_y == "F"' )[['Country', 'Total_x', 'Total_y', 'totmigrants_x']]
    reg_df.columns = ['Country', 'M', 'F', 'Total']
    reg_df['Year'] = year
    reg_dfs.append(reg_df)

print(pd.concat(reg_dfs).sort(['Country']))

#       Country   M   F Total  Year
#1  Afghanistan  70   3    73  2000
#1  Afghanistan  60  15    75  2001
#5      Albania   5   6    11  2000
#5      Albania  11   4    15  2001