Question

继续我对文本分析的探索，我遇到了另一个障碍。我理解逻辑但不知道如何在R中做到这一点。这就是我想做的事情：我有2个CSV- 1.包含10,000个注释2.包含单词列表我想选择所有那些包含第二个CSV中任何单词的评论。我该怎么办呢？

示例：

**CSV 1:**
this is a sample set
the comments are not real
this is a random set of words
hope this helps the problem case
thankyou for helping out
i have learned a lot here
feel free to comment

**CSV 2**
sample
set
comment

**Expected output:**
 this is a sample set
 the comments are not real
 this is a random set of words
 feel free to comment

请注意：还考虑了不同形式的单词，例如，评论和评论都被考虑。

Answer 1

我们可以在grep第二个数据集中的元素之后使用paste。

v1 <- scan("file2.csv", what ="")
lines1 <- readLines("file1.csv")
grep(paste(v1, collapse="|"), lines1, value=TRUE)
#[1] "this is a sample set"          "the comments are not real" 
#[3] "this is a random set of words" "feel free to comment"

Answer 2

首先从文件中创建两个名为lines和words.to.match的对象。你可以这样做：

lines <- read.csv('csv1.csv', stringsAsFactors=F)[[1]]
words.to.match <- read.csv('csv2.csv', stringsAsFactors=F)[[1]]

让我们说它们看起来像这样：

lines <- c(
  'this is a sample set',
  'the comments are not real',
  'this is a random set of words',
  'hope this helps the problem case',
  'thankyou for helping out',
  'i have learned a lot here',
  'feel free to comment'
)
words.to.match <- c('sample', 'set', 'comment')

然后，您可以使用两个嵌套的*apply-函数计算匹配项：

matches <- mapply(
    function(words, line)
        any(sapply(words, grepl, line, fixed=T)),
    list(words.to.match),
    lines
)
matched.lines <- lines[which(matches)]

这里发生了什么？我使用mapply来计算行中每行的函数，将words.to.match作为另一个参数。请注意，list(words.to.match)的基数为1.我只是在每个应用程序中回收此参数。然后，在mapply函数内部，我调用sapply函数来检查是否有任何单词与行匹配（我通过grepl检查匹配）。

这不一定是最有效的解决方案，但它对我来说更容易理解。您可以计算matches的另一种方法是：

matches <- lapply(words.to.match, grepl, lines, fixed=T)
matches <- do.call("rbind", matches)
matches <- apply(matches, c(2), any)

我不喜欢这个解决方案，因为你需要做一个do.call("rbind",...)，这有点黑客。

仅从评论列表中提取相关评论

2 个答案: