我从这样的输入数据开始
email country_code
12345kinglobito94@hotmail.com RU
12345arturdyikan6211@gmail.com RU
12345leonardosebastianld.20@gmail.com PE
12345k23156876vs@hotmail.com RU
12345jhuillcag@hotmail.com PE
12345ergasovaskazon72@gmail.com RU
12345myrzadaevajrat@gmail.com RU
12345filomena@hotmail.com BR
12345jppicotajose20@hotmail.com BR
... ...
打印时显示如下:
email country_code
0 12345kinglobito94@hotmail.com RU
1 12345arturdyikan6211@gmail.com RU
2 12345leonardosebastianld.20@gmail.com PE
3 12345k23156876vs@hotmail.com RU
4 12345jhuillcag@hotmail.com PE
5 12345ergasovaskazon72@gmail.com RU
6 12345myrzadaevajrat@gmail.com RU
7 12345filomena@hotmail.com BR
8 12345jppicotajose20@hotmail.com BR
... ...
分组很简单:
country_code
AR 21
BR 340
PE 198
RU 402
US 39
Name: email, dtype: int64
但我想算一下有多少hotmail&每个国家/地区的Gmail域名
答案 0 :(得分:2)
使用正则表达式提取域,然后使用groupby()。size()即
df['domains'] = df['email'].str.extract('@(.*?)\.',expand=False)
email country_code domains
0 12345kinglobito94@hotmail.com RU hotmail
1 12345arturdyikan6211@gmail.com RU gmail
2 12345leonardosebastianld.20@gmail.com PE gmail
3 12345k23156876vs@hotmail.com RU hotmail
4 12345jhuillcag@hotmail.com PE hotmail
5 12345ergasovaskazon72@gmail.com RU gmail
6 12345myrzadaevajrat@gmail.com RU gmail
7 12345filomena@hotmail.com BR hotmail
8 12345jppicotajose20@hotmail.com BR hotmail
df.groupby(["country_code","domains"]).size()
country_code domains
BR hotmail 2
PE gmail 1
hotmail 1
RU gmail 3
hotmail 2
dtype: int64
如果您不想要额外的列,也可以
df.groupby(["country_code",df['email'].str.extract('@(.*?)\.',expand=False)]).size()
答案 1 :(得分:1)
我们也可以使用str.replace()
,但我认为@ Dark的变体更具惯用性:
In [17]: (df.assign(domain=df['email'].str.replace(r'.*?@(.*?)\.\w+', r'\1'))
...: .groupby(['country_code', 'domain'])['email']
...: .count()
...: .reset_index(name='count'))
...:
Out[17]:
country_code domain count
0 BR hotmail 2
1 PE gmail 1
2 PE hotmail 1
3 RU gmail 3
4 RU hotmail 2