仅从评论列表中提取相关评论

时间:2016-05-25 09:23:50

标签: r text-mining

继续我对文本分析的探索,我遇到了另一个障碍。我理解逻辑但不知道如何在R中做到这一点。 这就是我想做的事情: 我有2个CSV- 1.包含10,000个注释2.包含单词列表 我想选择所有那些包含第二个CSV中任何单词的评论。我该怎么办呢?

示例:

**CSV 1:**
this is a sample set
the comments are not real
this is a random set of words
hope this helps the problem case
thankyou for helping out
i have learned a lot here
feel free to comment

**CSV 2**
sample
set
comment

**Expected output:**
 this is a sample set
 the comments are not real
 this is a random set of words
 feel free to comment

请注意: 还考虑了不同形式的单词,例如,评论和评论都被考虑。

2 个答案:

答案 0 :(得分:1)

我们可以在grep第二个数据集中的元素之后使用paste

v1 <- scan("file2.csv", what ="")
lines1 <- readLines("file1.csv")
grep(paste(v1, collapse="|"), lines1, value=TRUE)
#[1] "this is a sample set"          "the comments are not real" 
#[3] "this is a random set of words" "feel free to comment"   

答案 1 :(得分:0)

首先从文件中创建两个名为lineswords.to.match的对象。你可以这样做:

lines <- read.csv('csv1.csv', stringsAsFactors=F)[[1]]
words.to.match <- read.csv('csv2.csv', stringsAsFactors=F)[[1]]

让我们说它们看起来像这样:

lines <- c(
  'this is a sample set',
  'the comments are not real',
  'this is a random set of words',
  'hope this helps the problem case',
  'thankyou for helping out',
  'i have learned a lot here',
  'feel free to comment'
)
words.to.match <- c('sample', 'set', 'comment')

然后,您可以使用两个嵌套的*apply-函数计算匹配项:

matches <- mapply(
    function(words, line)
        any(sapply(words, grepl, line, fixed=T)),
    list(words.to.match),
    lines
)
matched.lines <- lines[which(matches)]

这里发生了什么?我使用mapply来计算行中每行的函数,将words.to.match作为另一个参数。请注意,list(words.to.match)的基数为1.我只是在每个应用程序中回收此参数。然后,在mapply函数内部,我调用sapply函数来检查是否有任何单词与行匹配(我通过grepl检查匹配)。

这不一定是最有效的解决方案,但它对我来说更容易理解。您可以计算matches的另一种方法是:

matches <- lapply(words.to.match, grepl, lines, fixed=T)
matches <- do.call("rbind", matches)
matches <- apply(matches, c(2), any)

我不喜欢这个解决方案,因为你需要做一个do.call("rbind",...),这有点黑客。