在一个练习中,我被要求通过内部联接(df1 + df2 + df3 = mergedDf)合并3个DataFrame,然后在另一个问题中,我被要求告诉我在执行这种3向合并时丢失了多少个条目
#DataFrame1
df1 = pd.DataFrame(columns=["Goals","Medals"],data=[[5,2],[1,0],[3,1]])
df1.index = ['Argentina','Angola','Bolivia']
print(df1)
Goals Medals
Argentina 5 2
Angola 1 0
Bolivia 3 1
#DataFrame2
df2 = pd.DataFrame(columns=["Dates","Medals"],data=[[1,0],[2,1],[2,2])
df2.index = ['Venezuela','Africa']
print(df2)
Dates Medals
Venezuela 1 0
Africa 2 1
Argentina 2 2
#DataFrame3
df3 = pd.DataFrame(columns=["Players","Goals"],data=[[11,5],[11,1],[10,0]])
df3.index = ['Argentina','Australia','Belgica']
print(df3)
Players Goals
Argentina 11 5
Australia 11 1
Spain 10 0
#mergedDf
mergedDf = pd.merge(df1,df2,how='inner',left_index=True, right_index=True)
mergedDf = pd.merge(mergedDf,df3,how='inner',left_index=True, right_index=True)
print(mergedDF)
Goals_X Medals_X Dates Medals_Y Players Goals_Y
Argentina 5 2 2 2 11 2
#Calculate number of lost entries by code
答案 0 :(得分:2)
具有外部联接和参数指示符的解决方案,根据both
值的总和,在两个指示符列a
和b
中没有True
的最后一个计数行(过程类似{{ 1}} s):
1
另一种解决方案是使用内部联接和每个与mergedDf = pd.merge(df1,df2,how='outer',left_index=True, right_index=True, indicator='a')
mergedDf = pd.merge(mergedDf,df3,how='outer',left_index=True, right_index=True, indicator='b')
print(mergedDf)
Goals_x Medals_x Dates Medals_y a Players Goals_y \
Africa NaN NaN 2.0 1.0 right_only NaN NaN
Angola 1.0 0.0 NaN NaN left_only NaN NaN
Argentina 5.0 2.0 2.0 2.0 both 11.0 5.0
Australia NaN NaN NaN NaN NaN 11.0 1.0
Belgica NaN NaN NaN NaN NaN 10.0 0.0
Bolivia 3.0 1.0 NaN NaN left_only NaN NaN
Venezuela NaN NaN 1.0 0.0 right_only NaN NaN
b
Africa left_only
Angola left_only
Argentina both
Australia right_only
Belgica right_only
Bolivia left_only
Venezuela left_only
missing = ((mergedDf['a'] != 'both') & (mergedDf['b'] != 'both')).sum()
print (missing)
6
不匹配的索引的sum
过滤值:
mergedDf.index
如果每个索引中都有唯一值,请注意:
mergedDf = pd.merge(df1,df2,how='inner',left_index=True, right_index=True)
mergedDf = pd.merge(mergedDf,df3,how='inner',left_index=True, right_index=True)
vals = mergedDf.index
print (vals)
Index(['Argentina'], dtype='object')
dfs = [df1, df2, df3]
missing = sum((~x.index.isin(vals)).sum() for x in dfs)
print (missing)
6
答案 1 :(得分:1)
您可以将True传递给合并中的create_blob_from_path()
indicator
然后与df1=pd.DataFrame({'A':[1,2,3],'B':[1,1,1]})
df2=pd.DataFrame({'A':[2,3],'B':[1,1]})
df1.merge(df2,on='A',how='inner')
Out[257]:
A B_x B_y
0 2 1 1
1 3 1 1
df1.merge(df2,on='A',how='outer',indicator =True)
Out[258]:
A B_x B_y _merge
0 1 1 NaN left_only
1 2 1 1.0 both
2 3 1 1.0 both
mergedf=df1.merge(df2,on='A',how='outer',indicator =True)
一起知道您在value_counts
时损失了多少,因为只有inner
会在both
时保留
how='inner'
对于3 df并使用两个合并列进行过滤的单词为mergedf['_merge'].value_counts()
Out[260]:
both 2
left_only 1
right_only 0
Name: _merge, dtype: int64
both
答案 2 :(得分:1)
我找到了一个简单但有效的解决方案:
df1 = Df1()
df2 = Df2()
df3 = Df3()
inner = pd.merge(pd.merge(df1,df2,on='<Common column>',how='inner'),df3,on='<Common column>',how='inner')
outer = pd.merge(pd.merge(df1,df2,on='<Common column>',how='outer'),df3,on='<Common column>',how='outer')
return (len(outer)-len(inner))