Question

我在第一列中有一个带句子的数据框，我想计算其中的单词：

输入：

Foo bar
bar example
lalala foo
example sentence foo

输出：

foo       3
bar       2
example   2
lalala    1
sentence  1

有一种简单的方法吗？

如果没有，我该怎么办？我看到两种方式：

Append all the sentences in one huge string
And then count the words somehow

（效率非常低）或者

Split the column in multiple columns on spaces " " (I know there's a package for that, can't remember which one)
And then rbind each columns into one

Answer 1

与第二种方法一样。我们可以split空格（" "）上的列，然后使用table计算每个单词的频率。此外，输出似乎不区分大小写，因此在拆分之前将列转换为小写。

假设您的数据框名为df，目标列为V1。

table(unlist(strsplit(tolower(df$V1), " ")))

 #bar  example      foo   lalala sentence 
 #  2        2        3        1        1

如果需要在数据框中，

data.frame(table(unlist(strsplit(tolower(df$V1), " "))))

#      Var1 Freq
#1      bar    2
#2  example    2
#3      foo    3
#4   lalala    1
#5 sentence    1

修改

根据OP在评论中的更新，如果每个句子都有score列，我们需要为每个单词sum。

添加可重复的示例

df <- data.frame(v1 = c("Foo bar", "bar example", "lalala foo","example sentence foo"), score = c(2, 3, 1, 4)) df # v1 score #1 Foo bar 2 #2 bar example 3 #3 lalala foo 1 #4 example sentence foo 4

解决此问题的一种方法是使用包splitstackshape和dplyr。我们使用cSplit将每个句子转换为长数据帧，然后汇总计算频率（n()）和sum的每个字。

library(splitstackshape) library(dplyr) cSplit(df, "v1", sep = " ", direction = "long") %>% group_by(tolower(v1)) %>% summarise(Count = n(), ScoreSum = sum(score)) # tolower(v1) Count ScoreSum # (chr) (int) (dbl) #1 foo 3 7 #2 bar 2 5 #3 example 2 7 #4 lalala 1 1 #5 sentence 1 4

或仅使用tidyverse

library(tidyverse) df %>% separate_rows(v1, sep = ' ') %>% group_by(v1 = tolower(v1)) %>% summarise(Count = n(), ScoreSum = sum(score))

Answer 2

试试这个：

str.isfloat()

计算数据框列中的单词

2 个答案: