R两列完全匹配的字符串

时间:2019-04-13 11:50:02

标签: r string-matching data-manipulation

我具有以下形式的数据框:

Column1 = c('Elephant,Starship Enterprise,Cat','Random word','Word','Some more words, Even more words')
Column2=c('Rat,Starship Enterprise,Elephant','Ocean','No','more')
d1 = data.frame(Column1,Column2)

enter image description here

我想做的是查找并计算第1列和第2列中单词的完全匹配。每列可以有多个单词,并用逗号分隔。

例如,在第1行中,我们看到两个常见的单词:a)Starship Enterprise和b)Elephant。但是,在第4行中,即使单词“更多” 出现在两列中,也不会出现确切的字符串(更多的单词,甚至更多的单词)。预期的输出将是这样的。

enter image description here

任何帮助将不胜感激。

2 个答案:

答案 0 :(得分:2)

以逗号分隔列并计算单词的交集

mapply(function(x, y) length(intersect(x, y)), 
        strsplit(d1$Column1, ","), strsplit(d1$Column2, ","))
#[1] 2 0 0 0

或以tidyverse的方式

library(tidyverse)
d1 %>%
  mutate(Common = map2_dbl(Column1, Column2, ~ 
      length(intersect(str_split(.x, ",")[[1]], str_split(.y, ",")[[1]]))))


#                           Column1                          Column2 Common
#1 Elephant,Starship Enterprise,Cat Rat,Starship Enterprise,Elephant      2
#2                      Random word                            Ocean      0
#3                             Word                               No      0
#4 Some more words, Even more words                             more      0

答案 1 :(得分:1)

我们可以使用cSplit

library(splitstackshape)
library(data.table)
v1 <- cSplit(setDT(d1, keep.rownames = TRUE), 2:3, ",", "long")[, 
    length(intersect(na.omit(Column1), na.omit(Column2))), rn]$V1
d1[, Common := v1][, rn := NULL][]
#                             Column1                          Column2 Common
#1: Elephant,Starship Enterprise,Cat Rat,Starship Enterprise,Elephant      2
#2:                      Random word                            Ocean      0
#3:                             Word                               No      0
#4: Some more words, Even more words                             more      0