我在pyton中有两个不同的数据帧,如下所示:
import pandas
df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"],
'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]})
df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"],
'BBB' : ["c11","c21","c21","c12","c13","c13","c11"]})
我想比较“AAA”的值并根据“BBB”组找到相似值的数量。 例如c1和c11之间的相似度是1(a1) c2,c21之间的相似性为2(a2,a4)
答案 0 :(得分:0)
以下代码计算您想要的相似之处(它不使用CCC
列):
sims = pd.merge(df,df2,how='outer').\
groupby(['AAA'])['BBB'].sum().value_counts().reset_index()
# index BBB
#0 c2c21 2
#1 c3c12 1
#2 c1c13 1
#3 c1c11 1
#4 c2 1
#5 c11 1
#6 c3 1
#7 c13 1
sims['index'] = sims['index'].str.split('c').str[1:]
sims[sims['index'].str.len() > 1]
# index BBB
#0 [2, 21] 2
#1 [3, 12] 1
#2 [1, 13] 1
#3 [1, 11] 1
答案 1 :(得分:0)
可以这样计算:
# merge both dataframes on column 'AAA' since
# in the end only the rows are of interest
# for which AAA is equal in both frames
merged= df.merge(df2, on='AAA', suffixes=['_df', '_df2'])
# define a function that can be used
# to check the BBB-string of df2 starts
# with the BBB-string of df
def check(o):
return o['BBB_df2'].startswith(o['BBB_df'])
# apply it to the dataframe to filter the rows
matches= merged.apply(check , axis='columns')
# now aggregate only the rows to which both
# criterias apply
result= merged[matches].groupby(['BBB_df', 'BBB_df2']).agg({'AAA': ['nunique', set]})
result.columns= ['similarity', 'AAA_values']
result.reset_index()
输出为:
Out[111]:
BBB_df BBB_df2 similarity AAA_values
0 c1 c11 1 {a1}
1 c1 c13 1 {a7}
2 c2 c21 2 {a2, a4}
输入数据:
import pandas
df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"],
'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]})
df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"],
'BBB' : ["c11","c21","c21","c12","c13","c13","c11"]})