如何比较两个字符串向量之间匹配的语句数

时间:2017-09-29 10:51:19

标签: r

I want to compare two string vectors as follows:

Test1<-c("Everything is normal","It is all sunny","Its raining cats and dogs","Mild")

Test2<-c("Everything is normal","It is thundering","Its raining cats and dogs","Cloudy")

Filtered<-data.frame(Test1,Test2)

预期输出:

Number the same: 2
Number present in Test1 and not in Test2: 2
Number present in Test2 and not in Test1: 2

我还想看看哪些字符串不同,以便其他预期输出应如下(并且也是原始数据帧的一部分)

Same<-c("Everything is normal","Its raining cats and dogs")
OnlyInA<-c("It is all sunny")
OnlyInB<-c("It is thundering","Cloudy")

我尝试过:

Filtered$Same<-intersect(Filtered$A,Filtered$B)
Filtered$InAButNotB<-setdiff(Filtered$A,Filtered$B)

但是当我尝试最后一行时,我得到错误替换有127行,数据有400(如果我使用更长的数据集)。

我想这是因为我只返回有差异的行,所以列不匹配。我如何NA与setdiff没有差异的行,以便我可以将其保留在原始数据框中?

2 个答案:

答案 0 :(得分:1)

基础R outer函数将函数应用于两个向量的每个元素的每个组合。因此,将outer'=='一起使用会比较每个向量的每个元素:

Test1<-c("Everything is normal","It is all sunny","Its raining cats and dogs")
Test2<-c("Everything is normal","It is thundering","Its raining cats and dogs","Cloudy")

# test each element in Test1 for equality with each element in Test2
compare <- outer(Test1, Test2, '==') 

# calculate overlaps and uniques
overlaps <- sum(compare) # number of overlaps: 2
unique.test1 <- (rowSums(compare) == 0) # in Test1 but not Test2
unique.test2 <- (colSums(compare) == 0) # in Test2 but not Test1

# return uniques
OnlyInA <- Test1[unique.test1]
OnlyInB <- Test2[unique.test2]
same <- Test1[rowSums(compare) == 1]

# counts
n.unique.a <- sum(unique.test1)
n.unique.b <- sum(unique.test2)

或者,%in%运算符也适用于此类事物:

Test1[Test1 %in% Test2]
[1] "Everything is normal"      "Its raining cats and dogs"

Test1[!(Test1 %in% Test2)]
[1] "It is all sunny"

Test2[!(Test2 %in% Test1)]
[1] "It is thundering" "Cloudy"    

答案 1 :(得分:0)

使用tidyverse函数,您可以尝试类似:

Filtered %>%
  summarise(comm = sum(Test1 %in% Test2),
            InA = sum(!(Test1 %in% Test2)),
            InB = sum(!(Test2 %in% Test1)))

虽然,对于处理向量,如果您只对聚合计数感兴趣,您也可以尝试以下方法

length(intersect(Test1,Test2))
length(setdiff(Test1,Test2))