是否有可能基于2列比较4个数据框,并获得包含重复的结果(如果出现在2个或更多数据框中)。结果应包含发生次数。我的数据框看起来像
Select
"Store.id", "Store.Name", "Store.gstno", "Store.addressId",
"Address.addressId", "Address.addressLine1", "Address.addressLine2",
"Address.postcode", "Address.countryId",
"Country.countryId", "Country.name"
from
SOMETABLENAME WITH JOINS;
预期结果
>>>df1
Circle Division Power
0 AAAA AA 25
1 BBBB BB 5
>>>df2
Circle Division Power
0 CCCC CC 25
1 BBBB BB 66
>>>df3
Circle Division Power
0 DDDD DD 55
1 FFFF FF 68
2 AAAA AA 87
>>>df4
Circle Division Power
0 AAAA AA 45
1 CCCC CC 56
我试图一个接一个地合并,但后来陷入困境。
>>>result_df
Circle Division Power1 power2 power3 power4 Repeated
0 AAAA AA 25 - 87 45 3
1 BBBB BB 5 66 - - 2
2 CCCC CC - 25 - 56 2
答案 0 :(得分:2)
将concat
与DataFrame.set_index
和参数keys
结合使用,将所有DataFrame连接在一起,展平MultiIndex
。
通过DataFrame.count
创建新列以获取每行非NaN
的值,并通过boolean indexing
进行过滤:
dfs = [df1, df2, df3, df4]
comp = [x.set_index(['Circle','Division']) for x in dfs]
df = pd.concat(comp, axis=1, keys=(range(1, len(dfs)+ 1)))
df.columns = [f'{b}{a}' for a, b in df.columns]
df['Repeat'] = df.count(axis=1)
df = df[df['Repeat'] > 1]
df = df.reset_index()
print (df)
Circle Division Power1 Power2 Power3 Power4 Repeat
0 AAAA AA 25.0 NaN 87.0 45.0 3
1 BBBB BB 5.0 66.0 NaN NaN 2
2 CCCC CC NaN 25.0 NaN 56.0 2