从pandas数据框中,如何查找每个用户的重复注释数?

时间:2018-05-27 13:44:46

标签: python pandas dataframe

我的数据框中包含用户名列表及其注释,请参阅下面的格式。

为每个用户找到重复的重复评论(垃圾邮件)的最快捷,最有效的方法是什么?

数据帧格式:

Author  | Comment
casy    Nice picture! 
linda   I like this 
casy    Nice picture! 
tom     I disagree 
bob     Follow me 
bob     Follow me 
bob     Follow me 
bob     Follow me 
casy    Nice picture! 
casy    Wow! 
linda   Interesting post 
linda   Check my profile
bob     Dissapointing
casy    Wow! 

我想以下列格式得到结果,因此得到的表将是:

Author  | Number of dup. comments (descending)  | Comment   
bob     4   Follow me 
casy    3   Nice picture
casy    2   Wow! 
bob     1   Dissapointing 
linda   1   I like this 
linda   1   Check my profile
linda   1   Interesting post 
tom     1   I disagree

2 个答案:

答案 0 :(得分:4)

首先使用groupby,然后size使用sort_values,按reset_index创建列,并在必要时使用reindex更改列的顺序:

df = (df.groupby(['Author', 'Comment'], sort=False).size()
       .sort_values(ascending=False)
       .reset_index(name='Number')
       .reindex(columns=['Author','Number','Comment']))
print (df)
  Author  Number           Comment
0    bob       4         Follow me
1   casy       3     Nice picture!
2   casy       2              Wow!
3    bob       1     Dissapointing
4  linda       1  Check my profile
5  linda       1  Interesting post
6    tom       1        I disagree
7  linda       1       I like this

答案 1 :(得分:1)

直观的value_counts

......使用专门用于此目的的方法

df.groupby('Author').Comment.value_counts().sort_values(
    ascending=False).reset_index(name='Number')

  Author           Comment  Number
0    bob         Follow me       4
1   casy     Nice picture!       3
2   casy              Wow!       2
3    tom        I disagree       1
4  linda  Interesting post       1
5  linda       I like this       1
6  linda  Check my profile       1
7    bob     Dissapointing       1

pd.factorizenp.bincount

f, u = pd.factorize(list(zip(df.Author, df.Comment)))
a, c = zip(*u)
pd.DataFrame(dict(
    Author=a, Comment=c, Number=np.bincount(f)
)).sort_values('Number', ascending=False)

Counter

from collections import Counter

pd.Series(
    Counter(zip(df.Author, df.Comment))
).rename_axis(['Author', 'Comment']).reset_index(name='Number')