我的数据框:
pd.DataFrame({'company':['Chipotle','Branchburg Chipotle','Chipotle NJ','Chipotle 8853','The Home Depot','Home Depot','28211 Home Depot','Wendys','BJs','Buffalo wings'],
'address':['123 Main Street Branchburg NJ 08853'
,'123 Main Street Branchburg NJ 08853'
,'123 Main Street Branchburg NJ 08853'
,'123 Main Street Branchburg NJ 08853'
,'1220 N Wendover Rd Charlotte NC 28211'
,'1220 N Wendover Rd Charlotte NC 28211'
,'1220 N Wendover Rd Charlotte NC 28211'
,'2805 Whitson St Selma CA 93662'
,'2805 Whitson St Selma CA 93662'
,'2805 Whitson St Selma CA 93662']})
company address
0 Chipotle 123 Main Street Branchburg NJ 08853
1 Branchburg Chipotle 123 Main Street Branchburg NJ 08853
2 Chipotle NJ 123 Main Street Branchburg NJ 08853
3 Chipotle 8853 123 Main Street Branchburg NJ 08853
4 The Home Depot 1220 N Wendover Rd Charlotte NC 28211
5 Home Depot 1220 N Wendover Rd Charlotte NC 28211
6 28211 Home Depot 1220 N Wendover Rd Charlotte NC 28211
7 Wendy's 2805 Whitson St Selma CA 93662
8 BJ's 2805 Whitson St Selma CA 93662
9 Buffalo wings 2805 Whitson St Selma CA 93662
我必须对“地址”进行分组,并在“公司”列中找到常用词,然后将其写入新的“计数”列中。因此,对于第一个地址,公共字是chipotle,因此计数为1;对于第二个地址,公共字是家得宝,因此计数为2;对于第三个地址,没有普通字,因此计数为0
预期的食物
company address count
0 Chipotle 123 Main Street Branchburg NJ 08853 1
1 The Home Depot 1220 N Wendover Rd Charlotte NC 28211 2
2 Wendy's 2805 Whitson St Selma CA 93662 0
我可以考虑遍历数据框并使用集合交集,但这过程太慢了。有没有熊猫方法可以做到这一点?
答案 0 :(得分:4)
from functools import reduce
import operator
def log(x):
inters = reduce(operator.and_, [set(r) for r in x.str.split()])
if inters: return (' '.join(inters), len(inters))
return (x.iloc[0], 0)
df.groupby('address').agg(log).company.apply(pd.Series).rename({0: 'company', 1: 'count'}, axis=1)
company count
address
1220 N Wendover Rd Charlotte NC 28211 Home Depot 2
123 Main Street Branchburg NJ 08853 Chipotle 1
2805 Whitson St Selma CA 93662 Wendys 0
如果熊猫为0.20
.rename(columns={0: 'company', 1: 'count'})