Question

我一直在尝试编写一个函数或使用apply family来选择数据框中包含我正在查找的单词的行，并将它们标记为标记。一行可以有多个标签。有人可以帮助我，我已经被困了一段时间。

如果我的问题不清楚，或者其他地方有答案，请指导我正确的方向。非常感激！

require(stringr)
require(dplyr)
df <- data.frame(sentences, rnorm(length(sentences)))

old = df %>% filter(str_detect(sentences, 'old')) %>% mutate(w = factor("old"))
new = df %>% filter(str_detect(sentences, 'new')) %>% mutate(w = factor("new"))
boy = df %>% filter(str_detect(sentences, 'boy')) %>% mutate(w = factor("boy"))
girl = df %>% filter(str_detect(sentences, 'girl')) %>% mutate(w = factor("girl"))
tags <- bind_rows(old, new, boy, girl)

所以我想选择有限数量的单词，例如：

tags <- c('bananas', 'apples', oranges)

我希望结果是一个data.frame，为我选择的每个单词添加新列。如果该行包含我选择的单词之一，则该单词的列应为TRUE，并以某种方式标记。像

这样的东西

Sentences     bananas     apples     oranges  
sentence1     TRUE        
sentence2                 TRUE
sentence3     TRUE
sentence4                            TRUE
sentence5                 TRUE       TRUE

或

Sentences     tag1        tag2
sentence1     bananas        
sentence2     apples
sentence3     bananas
sentence4     oranges
entences5     apples      oranges

或类似的东西。如果我能更清楚地解释，请告诉我。

Answer 1

你真的想使用apply函数吗？我很确定tm包是你正在寻找的。这是最简单，更健壮的方式。使用DocumentTermMatrix功能，您可以获得所需内容。我自己详细阐述了一些句子（语法水平很高）。最简单的方法是继续处理所有单词，一旦你有矩阵选择了你想要找到的单词的那些列。

sentence1 <- "This is a bananana"
sentence2 <- "This is an apple"
sentence3 <- "This is a watermelon and a banana"
sentence4 <- "This is a watermelon a banana an apple"

df_sentence <- rbind(sentence1, sentence2, sentence3, sentence4)

library(tm)
vs_sentence <- VectorSource(df_sentence)
vc_sentence <- VCorpus(vs_sentence)

clean_sentence <- tm_map(vc_sentence, removePunctuation)
dtm_sentence <- DocumentTermMatrix(clean_sentence)
as.matrix(dtm_sentence)

结果：

        Terms
Docs and apple banana this watermelon
   1   0     0      1    1          0
   2   0     1      0    1          0
   3   1     0      1    1          1
   4   0     1      1    1          1

还有还有另一项功能，可让您按行和条款按行获取文档：

as.matrix(TermDocumentMatrix(clean_sentence))
            Docs
Terms        1 2 3 4
  and        0 0 1 0
  apple      0 1 0 1
  banana     1 0 1 1
  this       1 1 1 1
  watermelon 0 0 1 1

如果您能提供一部分句子，那么为您提供更好的解决方案会更容易。 HTH！

如何迭代数据框中的行以检测不同的单词并将其保存在新列中？

1 个答案: