如何使用python大熊猫比较或合并两个数据框?

时间:2018-08-09 13:43:36

标签: python pandas

如何基于开始和数据列比较/合并两个数据帧,并获得计数缺失的空缺。

数据框1

id start 
1  2009
1  2010
1  2011
1  2012
2  2010
2  2011
2  2012
2  2013
2  2014

数据帧2

id data
1   2010
1   2012
2   2010
2   2011
2   2012

预期输出:

id first last size
1  2009   2009 1
1  2011   2011 1
2  2013   2014 2

我如何实现这一目标。

2 个答案:

答案 0 :(得分:1)

mergeindicator=True一起使用,并首先进行外部联接:

df11 = df1.rename(columns={'start':'data'})
df = df2.merge(df11, how='outer', indicator=True, on=['id','data']).sort_values(['id','data'])
print (df)
   id  data      _merge
5   1  2009  right_only
0   1  2010        both
6   1  2011  right_only
1   1  2012        both
2   2  2010        both
3   2  2011        both
4   2  2012        both
7   2  2013  right_only
8   2  2014  right_only

然后使用old solution-仅更改条件:

#boolean mask for check no right_only to variable for reuse
m = (df['_merge'] != 'right_only').rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df.index = m.cumsum()
print (df)
   id  data      _merge
g                      
0   1  2009  right_only
1   1  2010        both
1   1  2011  right_only
2   1  2012        both
3   2  2010        both
4   2  2011        both
5   2  2012        both
5   2  2013  right_only
5   2  2014  right_only

#filter only NaNs row and aggregate first, last and count.
df2 = (df[~m.values].groupby(['id', 'g'])['data']
                     .agg(['first','last','size'])
                     .reset_index(level=1, drop=True)
                     .reset_index())
print (df2)
   id  first  last  size
0   1   2009  2009     1
1   1   2011  2011     1
2   2   2013  2014     2

答案 1 :(得分:0)

昨天我为您回答了类似的问题。我不知道您在哪里获得第一列和最后一列,但这是根据上面的示例查找缺失年份的一种方法:

df1_year = pd.DataFrame(df1.groupby('id')['start'].apply(list))
df2_year = pd.DataFrame(df2.groupby('id')['data'].apply(list))
dfs = [df1_year,df2_year]
df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs)
df_final.reset_index(inplace=True)

def noMatch(a, b):
    return [x for x in a if x not in b]

df3 = []
for i in range(0, len(df_final)):
    df3.append(noMatch(df_final['start'][i],df_final['data'][i]))

missing_year = pd.DataFrame(df3)
missing_year['missingYear'] = missing_year.values.tolist()
df_concat = pd.concat([df_final, missing_year], axis=1)
df_concat = df_concat[['id','missingYear']]
df4 = []
for i in range(0,len(df_concat)):
    df4.append(df_concat.applymap(lambda x: x[i] if isinstance(x, list) else x))
df_final1 = reduce(lambda left,right: pd.merge(left,right,on='id'), df4)
pd.concat([df_final1[['id','missingYear_x']], df_final1[['id','missingYear_y']].rename(columns={'missingYear_y':'missingYear_x'})]).rename(columns={'missingYear_x':'missingYear'}).sort_index()

    id  missingYear
0   1   2009
0   1   2011
1   2   2013
1   2   2014

根据您的评论将其添加到df2中,只需添加数据