汇总dplyr管道

时间:2018-04-27 08:41:43

标签: r string dplyr

我想生成dplyr管道中字数统计频率的频率计数摘要。它必须在dplyr管道中,因为我实际上是从bigrquery查询它并且它充当dplyr管道。

假设我有这样的数据:

tf1 <- tbl_df(data.frame(row= c(1:5), body=c("tt t ttt j ss oe", "kpw eero", "pow eir sap r", "s", "oe")))

我想得到一个关于字数的摘要(类似这样):

   n_words freq
1   0    0
2   1    2
3   2    1
4   3    0
5   4    1
6   5    0
7   6    1

但是我需要在dplyr管道中执行此操作(类似下面的内容不起作用)

###NOT WORK
tf1 %>%
wordcount(body,sep=" ", count.function=sum) 

2 个答案:

答案 0 :(得分:5)

这是另一个使用complete来获取所有值的想法,

library(tidyverse)

tf1 %>% 
   mutate(n_words = stringr::str_count(body, ' ') + 1) %>% 
   count(n_words) %>% 
   complete(n_words = 0:max(n_words))

给出,

# A tibble: 7 x 2
  n_words     n
    <dbl> <int>
1      0.    NA
2      1.     2
3      2.     1
4      3.    NA
5      4.     1
6      5.    NA
7      6.     1

答案 1 :(得分:0)

library(dplyr)
library(stringr)
tf1 %>% mutate(wordcount = str_split(body, " ") %>% lengths()) %>% count(wordcount)
## # A tibble: 4 x 2
##   wordcount     n
##       <int> <int>
## 1         1     2
## 2         2     1
## 3         4     1
## 4         6     1

str_split(tf1$body, " ")返回

[[1]]
[1] "tt"  "t"   "ttt" "j"   "ss"  "oe" 

[[2]]
[1] "kpw"  "eero"

[[3]]
[1] "pow" "eir" "sap" "r"  

[[4]]
[1] "s"

[[5]]
[1] "oe"

lengths计算每个列表元素的长度,因此

str_split(tf1$body, " ") %>% lengths()
## [1] 6 2 4 1 1

使用wordcount

将其添加为列mutate

count返回列wordcount中找到值的次数,并将其存储为列n