在R中搜索字典术语的第一个匹配文本

时间:2016-09-16 06:38:57

标签: r regex match string-matching stringr

我有一个带有术语

的字典
terms <- c("hello world", "great job")
terms <- as.data.frame(terms)

,我想在其他data.frame中搜索包含文档

的第一个匹配项
doc <- c("i would like to say hello worlds", "hey friends hello world everyone", "i'm looking for a great job", "great job")
docs <- as.data.frame(doc)

期望的结果:

foundtext <- c("i would like to say hello worlds","i'm looking for a great job")
output <- cbind(terms, foundtext)

感谢您的协助!

1 个答案:

答案 0 :(得分:0)

此解决方案非常简单且有效。正如我所说,我没有使用正则表达式。

doc <- c("i would like to say hello worlds", "hey friends hello world everyone", "i'm looking for a great job", "great job")
docs <- as.data.frame(doc)
docs$match <- "not found" #or just empty
for (i in terms){

    docs$new <- grepl(i, docs$doc, perl=TRUE)
    docs$match[docs$new=="TRUE"] <- i
    next

}
docs <- subset(docs,,1:2)
docs$dupl <- !duplicated(docs$match, fromLast=FALSE)
docs <- subset(subset(docs, dupl=="TRUE"),,1:2)
docs