Question

我在数据框中有一列，其中每一行都是一个字符串。我想获得本专栏中每个单词的频率。

我试过了：

prov <- df$column_x %>%
    na.omit() %>%
    tolower() %>%
    gsub("[,;?']", " ",.)

sort(table(prov), decreasing = TRUE)

以这种方式，我得到每个string重复的次数。

我如何获得每个word重复的次数？

Answer 1

听起来你想要一个文档术语矩阵......

library(tm)

corp <- Corpus(VectorSource(df$x)) # convert column of strings into a corpus
dtm <- DocumentTermMatrix(corp)    # create document term matrix

> as.matrix(dtm)
    Terms
Docs hello world morning bye
   1     1     1       0   0
   2     2     0       1   0
   3     0     1       0   2

如果您希望将其加入原始数据框，您也可以这样做：

cbind(df, data.frame(as.matrix(dtm)))

                    x hello world morning bye
1         hello world     1     1       0   0
2 hello morning hello     2     0       1   0
3       bye bye world     0     1       0   2

使用的样本数据：

df <- data.frame(
  x = c("hello world", 
        "hello morning hello", 
        "bye bye world"),
  stringsAsFactors = FALSE
)

> df
                    x
1         hello world
2 hello morning hello
3       bye bye world

Answer 2

您可以将列折叠为一个字符串，然后使用正则表达式\\W 而不是单词将此字符串拆分为单词，并使用table函数计算每个单词频率。

library(dplyr)
x <- c("First part of some text,", "another part of text,",NA , "last part of text.")
x <- x %>% na.omit() %>% tolower() 
xx <- paste(x, collapse = " ")
xxx <- unlist(strsplit(xx, "\\W"))
table(xxx)
xxx
        another   first    last      of    part    some    text 
      2       1       1       1       3       3       1       3

Answer 3

管道完成工作。

df <- data.frame(column_x = c("hello world", "hello morning hello", 
                              "bye bye world"), stringsAsFactors = FALSE)
require(dplyr)
df$column_x %>%
  na.omit() %>%
  tolower() %>%
  strsplit(split = " ") %>% # or strsplit(split = "\\W") 
  unlist() %>%
  table() %>%
  sort(decreasing = TRUE)

一组字符串中每个单词的频率

3 个答案: