Question

我有以下具有postID和ReplyID的数据集：

      postId      replyId
1   6074801669  759224201176
2   6074801669  465047320447
3   6074801669  690812551148
4   6074801669  465047290095
5   6560801670  465047500011
6   6560801670  869614571745
7   6560801670  869614571745
8   11446901671 100552911701
9   11446901671 759224201176
10  11446901671 100552911701
11  11446901671 759224201176
12  11446901671 465047690560
13  11446901671 759224201176

我的问题是，我希望在唯一的postId上使用replyId的频率。更具体地说，不同的replyId出现多少次出现在特定的postId上。我不确定我的描述是否足够具体，但这是我想看到的：

      postId      replyId       replyId.freq
1   6074801669  759224201176       4
2   6074801669  465047320447       4
3   6074801669  690812551148       4
4   6074801669  465047290095       4
5   6560801670  465047500011       2
6   6560801670  869614571745       2
7   6560801670  869614571745       2
8   11446901671 100552911701       3
9   11446901671 759224201176       3
10  11446901671 100552911701       3
11  11446901671 759224201176       3
12  11446901671 465047690560       3
13  11446901671 759224201176       3

例如对于postId = 11446901671，即使此postId在数据框中出现6次，也会呈现3个不同的replyId。

Answer 1

我们可以按'postId'分组并使用n_distinct获取'replyId'的唯一元素数来创建新列

library(dplyr)
df %>%
    group_by(postId) %>%
    mutate(replyId.freq = n_distinct(replyId))

或与base R

df$replyId.freq <- with(df, ave(replyId, postId, 
          FUN = function(x) length(unique(x)))

唯一变量上变量的频率

1 个答案: