熊猫分组依据并查找常见字符串的数量

时间:2018-07-12 16:30:52

标签: python pandas pandas-groupby

我的数据框:

pd.DataFrame({'company':['Chipotle','Branchburg Chipotle','Chipotle NJ','Chipotle 8853','The Home Depot','Home Depot','28211 Home Depot','Wendys','BJs','Buffalo wings'],
'address':['123 Main Street Branchburg NJ 08853'
,'123 Main Street Branchburg NJ 08853'
,'123 Main Street Branchburg NJ 08853'
,'123 Main Street Branchburg NJ 08853'
,'1220 N Wendover Rd Charlotte NC 28211'
,'1220 N Wendover Rd Charlotte NC 28211'
,'1220 N Wendover Rd Charlotte NC 28211'
,'2805 Whitson St Selma CA 93662'
,'2805 Whitson St Selma CA 93662'
,'2805 Whitson St Selma CA 93662']})

    company                    address
0   Chipotle            123 Main Street Branchburg NJ 08853
1   Branchburg Chipotle 123 Main Street Branchburg NJ 08853
2   Chipotle NJ         123 Main Street Branchburg NJ 08853
3   Chipotle 8853       123 Main Street Branchburg NJ 08853
4   The Home Depot      1220 N Wendover Rd Charlotte NC 28211
5   Home Depot          1220 N Wendover Rd Charlotte NC 28211
6   28211 Home Depot    1220 N Wendover Rd Charlotte NC 28211
7   Wendy's             2805 Whitson St Selma CA 93662
8   BJ's                2805 Whitson St Selma CA 93662
9   Buffalo wings       2805 Whitson St Selma CA 93662

我必须对“地址”进行分组,并在“公司”列中找到常用词,然后将其写入新的“计数”列中。因此,对于第一个地址,公共字是chipotle,因此计数为1;对于第二个地址,公共字是家得宝,因此计数为2;对于第三个地址,没有普通字,因此计数为0

预期的食物

     company        address                               count
0   Chipotle        123 Main Street Branchburg NJ 08853     1
1   The Home Depot  1220 N Wendover Rd Charlotte NC 28211   2
2   Wendy's         2805 Whitson St Selma CA 93662          0

我可以考虑遍历数据框并使用集合交集,但这过程太慢了。有没有熊猫方法可以做到这一点?

1 个答案:

答案 0 :(得分:4)

from functools import reduce
import operator
def log(x):
    inters = reduce(operator.and_, [set(r) for r in x.str.split()])
    if inters: return (' '.join(inters), len(inters))
    return (x.iloc[0], 0)
df.groupby('address').agg(log).company.apply(pd.Series).rename({0: 'company', 1: 'count'}, axis=1)

                                        company     count
address     
1220 N Wendover Rd Charlotte NC 28211   Home Depot  2
123 Main Street Branchburg NJ 08853     Chipotle    1
2805 Whitson St Selma CA 93662          Wendys      0

如果熊猫为0.20

.rename(columns={0: 'company', 1: 'count'})