Question

我有一个数据框，它返回两个列变量 - word1和word2，如下所示：

head(bigrams_filtered2, 20)
# A tibble: 20 x 2
   word1       word2      
   <chr>       <chr>      
 1 practice    risk       
 2 risk        management 
 3 management  rational   
 4 rational    meansend   
 5 meansend    based      
 6 based       process    
 7 process     risks      
 8 risks       identified 
 9 identified  analysed   
10 analysed    solved     
11 solved      mitigated  
12 objective   involves   
13 involves    human      
14 human       perceptions
15 perceptions biases     
16 opportunity jack       
17 differences stakeholder
18 stakeholder perceptions
19 perceptions broader    
20 broader     risk

我正在尝试向此data.frame添加两个额外的列变量，以便我的输出如下所示：

##     word1     word2    n totalbigrams           tf
## 1     st     louis 1930      3426965 0.0005631805
## 2  happy  birthday 1802      3426965 0.0005258297
## 3      1         2 1701      3426965 0.0004963576
## 4    los   angeles 1385      3426965 0.0004041477
## 5 social     media 1256      3426965 0.0003665051
## 6    san francisco 1245      3426965 0.0003632952

我在这里关注一个例子http://www.rpubs.com/pnice421/347328

在“生成Bigrams”标题下，他们提供以下代码作为实现此目的的方法，但我返回错误：

totalbigrams <- bigrams_filtered2 %>%
    summarize(total=sum(n))

Error in summarise_impl(.data, dots) : 
Evaluation error: invalid 'type' (closure) of argument.

如果有人对我可能出错的地方有任何建议，我们将不胜感激！谢谢。

Answer 1

您收到错误，因为您的数据框中没有名为n的变量。你需要先生成它。您获得的具体错误是因为n函数中定义了tidyverse，它是计算数据中行数（或其子集）的函数

我不知道您的数据中应该包含n，但在使用该特定功能之前，您需要先了解它。

Answer 2

首先，让我们制作一个与您正在处理的结构具有相同结构的示例数据集。

library(tidyverse)
library(tidytext)
library(janeaustenr)


bigram_df <- data_frame(txt = prideprejudice) %>%
    unnest_tokens(bigram, txt, token = "ngrams", n = 2) %>%
    separate(bigram, c("word1", "word2"), sep = " ")

bigram_df

#> # A tibble: 122,203 x 2
#>    word1     word2    
#>    <chr>     <chr>    
#>  1 pride     and      
#>  2 and       prejudice
#>  3 prejudice by       
#>  4 by        jane     
#>  5 jane      austen   
#>  6 austen    chapter  
#>  7 chapter   1        
#>  8 1         it       
#>  9 it        is       
#> 10 is        a        
#> # ... with 122,193 more rows

现在我们可以使用dplyr＆＃39; count()找到每个二元组的使用次数，共有一个双字母总数，以及术语频率tf。这里的关键是使用tidyr的unite()和separate()将两个单词粘在一起，然后再将它们分开。

bigram_df %>%
    unite(bigram, word1, word2, sep = " ") %>%
    count(bigram, sort = TRUE) %>%
    separate(bigram, c("word1", "word2"), sep = " ") %>% 
    mutate(totalbigrams = sum(n),
           tf = n / totalbigrams)

#> # A tibble: 54,998 x 5
#>    word1 word2     n totalbigrams      tf
#>    <chr> <chr> <int>        <int>   <dbl>
#>  1 of    the     464       122203 0.00380
#>  2 to    be      443       122203 0.00363
#>  3 in    the     382       122203 0.00313
#>  4 i     am      302       122203 0.00247
#>  5 of    her     260       122203 0.00213
#>  6 to    the     252       122203 0.00206
#>  7 it    was     251       122203 0.00205
#>  8 mr    darcy   243       122203 0.00199
#>  9 of    his     234       122203 0.00191
#> 10 she   was     209       122203 0.00171
#> # ... with 54,988 more rows

由reprex package（v0.2.0）创建于2018-04-22。

听起来你做了一些过滤。当单词被分成两列时，你当然可以用dplyr filter()来做到这一点。

计算R data.frame中的行数并存储为附加变量

2 个答案: