如何基于开始和数据列比较/合并两个数据帧,并获得计数缺失的空缺。
数据框1
id start
1 2009
1 2010
1 2011
1 2012
2 2010
2 2011
2 2012
2 2013
2 2014
数据帧2
id data
1 2010
1 2012
2 2010
2 2011
2 2012
预期输出:
id first last size
1 2009 2009 1
1 2011 2011 1
2 2013 2014 2
我如何实现这一目标。
答案 0 :(得分:1)
将merge
与indicator=True
一起使用,并首先进行外部联接:
df11 = df1.rename(columns={'start':'data'})
df = df2.merge(df11, how='outer', indicator=True, on=['id','data']).sort_values(['id','data'])
print (df)
id data _merge
5 1 2009 right_only
0 1 2010 both
6 1 2011 right_only
1 1 2012 both
2 2 2010 both
3 2 2011 both
4 2 2012 both
7 2 2013 right_only
8 2 2014 right_only
然后使用old solution-仅更改条件:
#boolean mask for check no right_only to variable for reuse
m = (df['_merge'] != 'right_only').rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df.index = m.cumsum()
print (df)
id data _merge
g
0 1 2009 right_only
1 1 2010 both
1 1 2011 right_only
2 1 2012 both
3 2 2010 both
4 2 2011 both
5 2 2012 both
5 2 2013 right_only
5 2 2014 right_only
#filter only NaNs row and aggregate first, last and count.
df2 = (df[~m.values].groupby(['id', 'g'])['data']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
id first last size
0 1 2009 2009 1
1 1 2011 2011 1
2 2 2013 2014 2
答案 1 :(得分:0)
昨天我为您回答了类似的问题。我不知道您在哪里获得第一列和最后一列,但这是根据上面的示例查找缺失年份的一种方法:
df1_year = pd.DataFrame(df1.groupby('id')['start'].apply(list))
df2_year = pd.DataFrame(df2.groupby('id')['data'].apply(list))
dfs = [df1_year,df2_year]
df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs)
df_final.reset_index(inplace=True)
def noMatch(a, b):
return [x for x in a if x not in b]
df3 = []
for i in range(0, len(df_final)):
df3.append(noMatch(df_final['start'][i],df_final['data'][i]))
missing_year = pd.DataFrame(df3)
missing_year['missingYear'] = missing_year.values.tolist()
df_concat = pd.concat([df_final, missing_year], axis=1)
df_concat = df_concat[['id','missingYear']]
df4 = []
for i in range(0,len(df_concat)):
df4.append(df_concat.applymap(lambda x: x[i] if isinstance(x, list) else x))
df_final1 = reduce(lambda left,right: pd.merge(left,right,on='id'), df4)
pd.concat([df_final1[['id','missingYear_x']], df_final1[['id','missingYear_y']].rename(columns={'missingYear_y':'missingYear_x'})]).rename(columns={'missingYear_x':'missingYear'}).sort_index()
id missingYear
0 1 2009
0 1 2011
1 2 2013
1 2 2014
根据您的评论将其添加到df2中,只需添加数据