Question

我的数据框中包含用户名列表及其注释，请参阅下面的格式。

为每个用户找到重复的重复评论（垃圾邮件）的最快捷，最有效的方法是什么？

数据帧格式：

Author  | Comment
casy    Nice picture! 
linda   I like this 
casy    Nice picture! 
tom     I disagree 
bob     Follow me 
bob     Follow me 
bob     Follow me 
bob     Follow me 
casy    Nice picture! 
casy    Wow! 
linda   Interesting post 
linda   Check my profile
bob     Dissapointing
casy    Wow!

我想以下列格式得到结果，因此得到的表将是：

Author  | Number of dup. comments (descending)  | Comment   
bob     4   Follow me 
casy    3   Nice picture
casy    2   Wow! 
bob     1   Dissapointing 
linda   1   I like this 
linda   1   Check my profile
linda   1   Interesting post 
tom     1   I disagree

Answer 1

首先使用groupby，然后size使用sort_values，按reset_index创建列，并在必要时使用reindex更改列的顺序：

df = (df.groupby(['Author', 'Comment'], sort=False).size()
       .sort_values(ascending=False)
       .reset_index(name='Number')
       .reindex(columns=['Author','Number','Comment']))
print (df)
  Author  Number           Comment
0    bob       4         Follow me
1   casy       3     Nice picture!
2   casy       2              Wow!
3    bob       1     Dissapointing
4  linda       1  Check my profile
5  linda       1  Interesting post
6    tom       1        I disagree
7  linda       1       I like this

Answer 2

直观的`value_counts`

......使用专门用于此目的的方法

df.groupby('Author').Comment.value_counts().sort_values(
    ascending=False).reset_index(name='Number')

  Author           Comment  Number
0    bob         Follow me       4
1   casy     Nice picture!       3
2   casy              Wow!       2
3    tom        I disagree       1
4  linda  Interesting post       1
5  linda       I like this       1
6  linda  Check my profile       1
7    bob     Dissapointing       1

`pd.factorize`和`np.bincount`

f, u = pd.factorize(list(zip(df.Author, df.Comment)))
a, c = zip(*u)
pd.DataFrame(dict(
    Author=a, Comment=c, Number=np.bincount(f)
)).sort_values('Number', ascending=False)

`Counter`

from collections import Counter

pd.Series(
    Counter(zip(df.Author, df.Comment))
).rename_axis(['Author', 'Comment']).reset_index(name='Number')

从pandas数据框中，如何查找每个用户的重复注释数？

2 个答案:

直观的`value_counts`

`pd.factorize`和`np.bincount`

`Counter`

从pandas数据框中，如何查找每个用户的重复注释数？

2 个答案:

直观的value_counts

pd.factorize和np.bincount

Counter

直观的`value_counts`

`pd.factorize`和`np.bincount`

`Counter`