R:在另一个字符串中搜索一个字符串中的单词并返回无法匹配的单词

时间:2017-05-25 12:31:20

标签: r regex string datatable

我的数据表有两个文本列(col1col2)。两者都有句子。我想查找col1col2中的所有字词,并返回包含col1中的字词的字符串减去col2中找到的字词。以下是一个例子

            col1                 |         col2             |     output
america, uk have too much money  |   uk, uk money too too   |  america, have much

1 个答案:

答案 0 :(得分:1)

这样的事情?

DT <- data.table(col1 <- "america, uk have too much money", col2 <- "uk, uk  money too too")
DT[, output := paste(strsplit(DT[,col1], "(\\s+)|(?!')(?=[[:punct:]])", perl = TRUE)[[1]][!(strsplit(DT[,col1],"(\\s+)|(?!')(?=[[:punct:]])", perl = TRUE)[[1]] %in%  strsplit(DT[,col2], "(\\s+)|(?!')(?=[[:punct:]])", perl = TRUE)[[1]])], collapse = " ")]

虽然没有逗号