Question

我想计算单个单词矢量中每10个单词窗口中某些术语的频率：

一个例子是

mywords<-sample(c("POS","NNTD","DD","HG","KKL"),10000 replace = TRUE)
mywords<-data.frame(mywords)
names(mywords)<-c("TheTerms")

我希望每10个学期获得一个学期的频率。我想这可以在dplyr

中完成

mywords%>%group_by(TheTerms)%>%summarise(n=n())

但是如何完成10个单词？

Answer 1

这是一个想法，

library(dplyr)

 mywords %>% 
  group_by(grp = rep(seq(n()/10), each = 10)) %>% 
  count(TheTerms)

给出，

A tibble: 4,500 x 3
# Groups:   grp [1,000]
     grp TheTerms     n
   <int>   <fctr> <int>
 1     1       DD     3
 2     1       HG     4
 3     1      POS     3
 4     2       DD     1
 5     2       HG     1
 6     2      KKL     3
 7     2     NNTD     4
 8     2      POS     1
 9     3       HG     1
10     3      KKL     3
# ... with 4,490 more rows

Answer 2

另一个选项是library(data.table) setDT(mywords)[, .N,.(TheTerms, grp = as.integer(gl(nrow(mywords), 10, nrow(mywords))))]

:nth-child(3n)

Answer 3

在基础R中，您可以像这样使用table：

table(rep(seq_along(mywords$TheTerms), each=10, length.out=nrow(mywords)), mywords$TheTerms)

     DD HG KKL NNTD POS
  1   2  0   2    2   4
  2   3  2   4    0   1
  3   3  1   1    3   2
  4   4  3   1    1   1
  5   0  6   3    1   0
  6   1  2   1    3   3
  7   2  3   1    2   2
  8   4  2   1    1   2
  9   2  1   4    1   2
  10  3  1   2    2   2

为了显示目的，我将样本大小切换为100。

如何使用dplyr根据向量中的位置进行分组

3 个答案: