我想生成dplyr管道中字数统计频率的频率计数摘要。它必须在dplyr管道中,因为我实际上是从bigrquery查询它并且它充当dplyr管道。
假设我有这样的数据:
tf1 <- tbl_df(data.frame(row= c(1:5), body=c("tt t ttt j ss oe", "kpw eero", "pow eir sap r", "s", "oe")))
我想得到一个关于字数的摘要(类似这样):
n_words freq
1 0 0
2 1 2
3 2 1
4 3 0
5 4 1
6 5 0
7 6 1
但是我需要在dplyr管道中执行此操作(类似下面的内容不起作用)
###NOT WORK
tf1 %>%
wordcount(body,sep=" ", count.function=sum)
答案 0 :(得分:5)
这是另一个使用complete
来获取所有值的想法,
library(tidyverse)
tf1 %>%
mutate(n_words = stringr::str_count(body, ' ') + 1) %>%
count(n_words) %>%
complete(n_words = 0:max(n_words))
给出,
# A tibble: 7 x 2 n_words n <dbl> <int> 1 0. NA 2 1. 2 3 2. 1 4 3. NA 5 4. 1 6 5. NA 7 6. 1
答案 1 :(得分:0)
library(dplyr)
library(stringr)
tf1 %>% mutate(wordcount = str_split(body, " ") %>% lengths()) %>% count(wordcount)
## # A tibble: 4 x 2
## wordcount n
## <int> <int>
## 1 1 2
## 2 2 1
## 3 4 1
## 4 6 1
str_split(tf1$body, " ")
返回
[[1]]
[1] "tt" "t" "ttt" "j" "ss" "oe"
[[2]]
[1] "kpw" "eero"
[[3]]
[1] "pow" "eir" "sap" "r"
[[4]]
[1] "s"
[[5]]
[1] "oe"
lengths
计算每个列表元素的长度,因此
str_split(tf1$body, " ") %>% lengths()
## [1] 6 2 4 1 1
使用wordcount
mutate
count
返回列wordcount
中找到值的次数,并将其存储为列n