Question

我为电视节目的脚本提供了一个df，分为两列，一列用于演讲者，一列用于他们正在讲话的线路。我想按讲话者过滤所有行，然后计算所有行的单词，然后将该信息存储到新的df中，如下所示：

Speaker Words
John    10000
Bob     20000
Doe     30000

来自df的示例：

line                                                                    speaker
All right Jim. Your quarterlies look very good.                         Michael

到目前为止，我已经提出了这个建议：

df1 <- lines %>%
  filter(speaker == 'John')

wordcount(df1$line)

我想知道是否有for循环方法或其他替代方法可以简化此过程？谢谢！

Answer 1

我不清楚您为什么要进行for循环。您可以在此处采取几种方法。顺便说一句，您应始终在示例中指出正在使用的软件包。

首先，让我们创建一个可复制的示例。我们将从ngram命名空间调用wordcount函数，而不添加包。

library(tidyverse)
df <- data.frame(Speaker = rep(c("John", "Bob", "Doe"),2),
                   Words = NA)
  df[df$Speaker == "John",]$Words <- "All right Jim. Your quarterlies look very good"
  df[df$Speaker == "Bob",]$Words <- "You all look good, except for John"
  df[df$Speaker == "Doe",]$Words <- "John, your performance is terrible"

首先，我们可以使用tapply返回汇总总和，并即时将其强制转换为data.frame。

data.frame(Speaker = sort(unique(df$Speaker)), 
           total_words = as.numeric(tapply(df$Words, 
           df$Speaker, ngram::wordcount)) )

使用管道方法，我们可以按照您的示例进行操作，并返回单个讲话者的总单词数

df %>% 
  filter(Speaker == "John") %>%
  summarize(total_words = ngram::wordcount(Words)) %>%
  as.numeric()

或者使用管道方法，将所有发言人的总单词数作为data.frame

df %>%
  group_by(Speaker) %>%
  summarize(total_words = ngram::wordcount(Words)) %>%
  as.data.frame()

通过for循环过滤数据帧，然后将其存储到新数据帧中

1 个答案: