我想对客户进行分组,将count为1的项目与count> 1的项目进行匹配,如果所有项目都匹配,则将可能的合并ID添加到新列中。例如:客户1,id = 3个项目在id = 2中,因此它是一个匹配项,可能的合并ID为1;对于客户2,类似地,id = 7是计数1,并且项目在id = 5个项目之内,因此匹配和可能合并ID为4。
我的数据框:
count custmr id items
3 Customer1 1 Cabbage, beet, Okra, root
3 Customer1 2 Apple, Banana, Mango ,Pears, leafs
1 Customer1 3 Mango leafs
1 Customer1 4 tomato root
4 Customer2 5 grapes,leach,guava,pappaya
2 Customer2 6 blackberry,blueberry
1 Customer2 7 pappaya
预期输出:
count custmr id items probable_merge_id
3 Customer1 1 Cabbage, beet, Okra, root
3 Customer1 2 Apple, Banana, Mango ,Pears, leafs
1 Customer1 3 Mango leafs 2
1 Customer1 4 tomato root
4 Customer2 5 grapes,leach,guava,pappaya
2 Customer2 6 blackberry,blueberry
1 Customer2 7 pappaya 4
答案 0 :(得分:2)
首先按merge
创建交叉联接,按count=1
过滤,将字符串转换为set
,以便进行比较。最后为Series
创建map
:
df1 = df.merge(df, on='custmr')
df1 = df1[(df1['count_x'] == 1)]
df1['items_x'] = df1['items_x'].str.split('\s+|,\s*').apply(set)
df1['items_y'] = df1['items_y'].str.split('\s+|,\s*').apply(set)
df1 = df1[ df1['items_x'] < df1['items_y']]
print (df1)
count_x custmr id_x items_x count_y id_y \
9 1 Customer1 3 {Mango, leafs} 3 2
22 1 Customer2 7 {pappaya} 4 5
items_y
9 {Mango, Pears, leafs, Apple, Banana}
22 {grapes, pappaya, leach, guava}
s = df1.set_index('id_x')['id_y']
print (s)
id_x
3 2
7 5
Name: id_y, dtype: int64
df['probable_merge_id'] = df['id'].map(s)
print (df)
count custmr id items probable_merge_id
0 3 Customer1 1 Cabbage,beet,Okra,root NaN
1 3 Customer1 2 Apple,Banana,Mango,Pears,leafs NaN
2 1 Customer1 3 Mango leafs 2.0
3 1 Customer1 4 tomato root NaN
4 4 Customer2 5 grapes,leach,guava,pappaya NaN
5 2 Customer2 6 blackberry,blueberry NaN
6 1 Customer2 7 pappaya 5.0