我具有以下形式的数据框:
Column1 = c('Elephant,Starship Enterprise,Cat','Random word','Word','Some more words, Even more words')
Column2=c('Rat,Starship Enterprise,Elephant','Ocean','No','more')
d1 = data.frame(Column1,Column2)
我想做的是查找并计算第1列和第2列中单词的完全匹配。每列可以有多个单词,并用逗号分隔。
例如,在第1行中,我们看到两个常见的单词:a)Starship Enterprise和b)Elephant。但是,在第4行中,即使单词“更多” 出现在两列中,也不会出现确切的字符串(更多的单词,甚至更多的单词)。预期的输出将是这样的。
任何帮助将不胜感激。
答案 0 :(得分:2)
以逗号分隔列并计算单词的交集
mapply(function(x, y) length(intersect(x, y)),
strsplit(d1$Column1, ","), strsplit(d1$Column2, ","))
#[1] 2 0 0 0
或以tidyverse
的方式
library(tidyverse)
d1 %>%
mutate(Common = map2_dbl(Column1, Column2, ~
length(intersect(str_split(.x, ",")[[1]], str_split(.y, ",")[[1]]))))
# Column1 Column2 Common
#1 Elephant,Starship Enterprise,Cat Rat,Starship Enterprise,Elephant 2
#2 Random word Ocean 0
#3 Word No 0
#4 Some more words, Even more words more 0
答案 1 :(得分:1)
我们可以使用cSplit
library(splitstackshape)
library(data.table)
v1 <- cSplit(setDT(d1, keep.rownames = TRUE), 2:3, ",", "long")[,
length(intersect(na.omit(Column1), na.omit(Column2))), rn]$V1
d1[, Common := v1][, rn := NULL][]
# Column1 Column2 Common
#1: Elephant,Starship Enterprise,Cat Rat,Starship Enterprise,Elephant 2
#2: Random word Ocean 0
#3: Word No 0
#4: Some more words, Even more words more 0