Question

我在pyton中有两个不同的数据帧，如下所示：

import pandas
df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"], 
                   'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]})
df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"], 
                    'BBB' : ["c11","c21","c21","c12","c13","c13","c11"]})

我想比较“AAA”的值并根据“BBB”组找到相似值的数量。例如c1和c11之间的相似度是1（a1） c2，c21之间的相似性为2（a2，a4）

Answer 1

以下代码计算您想要的相似之处（它不使用CCC列）：

sims = pd.merge(df,df2,how='outer').\
       groupby(['AAA'])['BBB'].sum().value_counts().reset_index()
#   index  BBB
#0  c2c21    2
#1  c3c12    1
#2  c1c13    1
#3  c1c11    1
#4     c2    1
#5    c11    1
#6     c3    1
#7    c13    1

sims['index'] = sims['index'].str.split('c').str[1:]
sims[sims['index'].str.len() > 1]
#     index  BBB
#0  [2, 21]    2
#1  [3, 12]    1
#2  [1, 13]    1
#3  [1, 11]    1

Answer 2

可以这样计算：

# merge both dataframes on column 'AAA' since
# in the end only the rows are of interest
# for which AAA is equal in both frames
merged= df.merge(df2, on='AAA', suffixes=['_df', '_df2'])

# define a function that can be used
# to check the BBB-string of df2 starts
# with the BBB-string of df
def check(o):
    return o['BBB_df2'].startswith(o['BBB_df'])

# apply it to the dataframe to filter the rows    
matches= merged.apply(check , axis='columns')
# now aggregate only the rows to which both
# criterias apply
result= merged[matches].groupby(['BBB_df', 'BBB_df2']).agg({'AAA': ['nunique', set]})
result.columns= ['similarity', 'AAA_values']
result.reset_index()

输出为：

Out[111]: 
  BBB_df BBB_df2  similarity AAA_values
0     c1     c11           1       {a1}
1     c1     c13           1       {a7}
2     c2     c21           2   {a2, a4}

输入数据：

import pandas
df = pd.DataFrame({'AAA' : ["a1","a2","a3","a4","a5","a6","a7"], 
                   'BBB' : ["c1","c2","c2","c2","c3","c3","c1"]})
df2 = pd.DataFrame({'AAA' : ["a1","a2","a4","a6","a7","a8","a9"], 
                    'BBB' : ["c11","c21","c21","c12","c13","c13","c11"]})

如何在pandas

2 个答案: