如何使用groupby计算子字符串条目的数量

时间:2017-12-10 09:15:24

标签: python pandas pandas-groupby

我从这样的输入数据开始

email                               country_code
12345kinglobito94@hotmail.com           RU
12345arturdyikan6211@gmail.com          RU
12345leonardosebastianld.20@gmail.com   PE
12345k23156876vs@hotmail.com            RU
12345jhuillcag@hotmail.com              PE
12345ergasovaskazon72@gmail.com         RU
12345myrzadaevajrat@gmail.com           RU
12345filomena@hotmail.com               BR
12345jppicotajose20@hotmail.com         BR
...                                    ...

打印时显示如下:

                                      email country_code
0            12345kinglobito94@hotmail.com           RU
1           12345arturdyikan6211@gmail.com           RU
2    12345leonardosebastianld.20@gmail.com           PE
3             12345k23156876vs@hotmail.com           RU
4               12345jhuillcag@hotmail.com           PE
5          12345ergasovaskazon72@gmail.com           RU
6            12345myrzadaevajrat@gmail.com           RU
7                12345filomena@hotmail.com           BR
8          12345jppicotajose20@hotmail.com           BR
...                                                 ...

分组很简单:

country_code
AR     21
BR    340
PE    198
RU    402
US     39
Name: email, dtype: int64

但我想算一下有多少hotmail&每个国家/地区的Gmail域名

2 个答案:

答案 0 :(得分:2)

使用正则表达式提取域,然后使用groupby()。size()即

df['domains'] = df['email'].str.extract('@(.*?)\.',expand=False)

                                email country_code  domains
0          12345kinglobito94@hotmail.com           RU  hotmail
1         12345arturdyikan6211@gmail.com           RU    gmail
2  12345leonardosebastianld.20@gmail.com           PE    gmail
3           12345k23156876vs@hotmail.com           RU  hotmail
4             12345jhuillcag@hotmail.com           PE  hotmail
5        12345ergasovaskazon72@gmail.com           RU    gmail
6          12345myrzadaevajrat@gmail.com           RU    gmail
7              12345filomena@hotmail.com           BR  hotmail
8        12345jppicotajose20@hotmail.com           BR  hotmail

df.groupby(["country_code","domains"]).size()

country_code  domains
BR            hotmail    2
PE            gmail      1
              hotmail    1
RU            gmail      3
              hotmail    2
dtype: int64

如果您不想要额外的列,也可以

df.groupby(["country_code",df['email'].str.extract('@(.*?)\.',expand=False)]).size()

答案 1 :(得分:1)

我们也可以使用str.replace(),但我认为@ Dark的变体更具惯用性:

In [17]: (df.assign(domain=df['email'].str.replace(r'.*?@(.*?)\.\w+', r'\1'))
    ...:    .groupby(['country_code', 'domain'])['email']
    ...:    .count()
    ...:    .reset_index(name='count'))
    ...:
Out[17]:
  country_code   domain  count
0           BR  hotmail      2
1           PE    gmail      1
2           PE  hotmail      1
3           RU    gmail      3
4           RU  hotmail      2