我的数据框中包含用户名列表及其注释,请参阅下面的格式。
为每个用户找到重复的重复评论(垃圾邮件)的最快捷,最有效的方法是什么?
数据帧格式:
Author | Comment
casy Nice picture!
linda I like this
casy Nice picture!
tom I disagree
bob Follow me
bob Follow me
bob Follow me
bob Follow me
casy Nice picture!
casy Wow!
linda Interesting post
linda Check my profile
bob Dissapointing
casy Wow!
我想以下列格式得到结果,因此得到的表将是:
Author | Number of dup. comments (descending) | Comment
bob 4 Follow me
casy 3 Nice picture
casy 2 Wow!
bob 1 Dissapointing
linda 1 I like this
linda 1 Check my profile
linda 1 Interesting post
tom 1 I disagree
答案 0 :(得分:4)
首先使用groupby
,然后size
使用sort_values
,按reset_index
创建列,并在必要时使用reindex
更改列的顺序:
df = (df.groupby(['Author', 'Comment'], sort=False).size()
.sort_values(ascending=False)
.reset_index(name='Number')
.reindex(columns=['Author','Number','Comment']))
print (df)
Author Number Comment
0 bob 4 Follow me
1 casy 3 Nice picture!
2 casy 2 Wow!
3 bob 1 Dissapointing
4 linda 1 Check my profile
5 linda 1 Interesting post
6 tom 1 I disagree
7 linda 1 I like this
答案 1 :(得分:1)
value_counts
......使用专门用于此目的的方法
df.groupby('Author').Comment.value_counts().sort_values(
ascending=False).reset_index(name='Number')
Author Comment Number
0 bob Follow me 4
1 casy Nice picture! 3
2 casy Wow! 2
3 tom I disagree 1
4 linda Interesting post 1
5 linda I like this 1
6 linda Check my profile 1
7 bob Dissapointing 1
pd.factorize
和np.bincount
f, u = pd.factorize(list(zip(df.Author, df.Comment)))
a, c = zip(*u)
pd.DataFrame(dict(
Author=a, Comment=c, Number=np.bincount(f)
)).sort_values('Number', ascending=False)
Counter
from collections import Counter
pd.Series(
Counter(zip(df.Author, df.Comment))
).rename_axis(['Author', 'Comment']).reset_index(name='Number')