Question

我具有以下形式的数据框：

Column1 = c('Elephant,Starship Enterprise,Cat','Random word','Word','Some more words, Even more words')
Column2=c('Rat,Starship Enterprise,Elephant','Ocean','No','more')
d1 = data.frame(Column1,Column2)

我想做的是查找并计算第1列和第2列中单词的完全匹配。每列可以有多个单词，并用逗号分隔。

例如，在第1行中，我们看到两个常见的单词：a）Starship Enterprise和b）Elephant。但是，在第4行中，即使单词“更多” 出现在两列中，也不会出现确切的字符串（更多的单词，甚至更多的单词）。预期的输出将是这样的。

任何帮助将不胜感激。

Answer 1

以逗号分隔列并计算单词的交集

mapply(function(x, y) length(intersect(x, y)), 
        strsplit(d1$Column1, ","), strsplit(d1$Column2, ","))
#[1] 2 0 0 0

或以tidyverse的方式

library(tidyverse)
d1 %>%
  mutate(Common = map2_dbl(Column1, Column2, ~ 
      length(intersect(str_split(.x, ",")[[1]], str_split(.y, ",")[[1]]))))


#                           Column1                          Column2 Common
#1 Elephant,Starship Enterprise,Cat Rat,Starship Enterprise,Elephant      2
#2                      Random word                            Ocean      0
#3                             Word                               No      0
#4 Some more words, Even more words                             more      0

Answer 2

我们可以使用cSplit

library(splitstackshape)
library(data.table)
v1 <- cSplit(setDT(d1, keep.rownames = TRUE), 2:3, ",", "long")[, 
    length(intersect(na.omit(Column1), na.omit(Column2))), rn]$V1
d1[, Common := v1][, rn := NULL][]
#                             Column1                          Column2 Common
#1: Elephant,Starship Enterprise,Cat Rat,Starship Enterprise,Elephant      2
#2:                      Random word                            Ocean      0
#3:                             Word                               No      0
#4: Some more words, Even more words                             more      0

R两列完全匹配的字符串

2 个答案: