Question

我有一个包含许多行的数据集，其中包含水果描述，例如：

An apple hangs on an apple tree
Bananas are yellow and tasty 
The apple is tasty

我需要在此说明中找到唯一的词（我已经做完了），然后我必须计算出这些独特词出现了多少行。示例：

Apple 2 (rows)
Bananas 1 (rows)
tree 1 (rows)
tasty 2 (rows)

我已经做了类似的事情：

rows <- data_frame %>%
  filter(str_detect(variable, "apple"))
count_rows <- as.data.frame(nrow(rows))

但是问题是我有太多独特的单词，所以我不想手动进行。有什么想法吗？

Answer 1

一个dplyr和tidyr选项可以是：

df %>%
 rowid_to_column() %>%
 mutate(sentences = strsplit(sentences, " ", fixed = TRUE)) %>%
 unnest(sentences) %>%
 mutate(sentences = tolower(sentences)) %>%
 filter(sentences %in% list_of_words) %>%
 group_by(sentences) %>%
 summarise_all(n_distinct)

  sentences rowid
  <chr>     <int>
1 apple         2
2 bananas       1
3 tasty         2
4 tree          1

样本数据：

df <- data.frame(sentences = c("An apple hangs on an apple tree",
                               "Bananas are yellow and tasty",
                               "The apple is tasty"),
                 stringsAsFactors = FALSE)   

list_of_words <- tolower(c("Apple", "Bananas", "tree", "tasty"))

Answer 2

在基R中，可以按照以下步骤进行操作。

r <- apply(sapply(words, function(s) grepl(s, df[[1]], ignore.case = TRUE)), 2, sum)
as.data.frame(r)
#        r
#Apple   2
#Bananas 1
#tree    1
#tasty   2

数据。

x <-
"'An apple hangs on an apple tree'
'Bananas are yellow and tasty' 
'The apple is tasty'"

x <- scan(textConnection(x), what = character())
df <- data.frame(x)

words <- c("Apple", "Bananas", "tree", "tasty")

Answer 3

R的基本解决方案是将grepl与sapply或lapply一起使用：

sapply(list_of_words, function(x) sum(grepl(x, tolower(df$sentences), fixed = T)))
apple bananas    tree   tasty 
    2       1       1       2

计算包含单词的行数

3 个答案: