从两列检测相似的连续模式

时间:2019-01-28 21:00:24

标签: r text expression

x1= c("Sunwood", "Greengrass", "bluesky")
x2= c("Sun wood", "green", "sky Pl")

testframe = data.frame(Address1=x1, Address2=x2) 

比较两列的第三列的输出应显示“是”。因为存在表示匹配的“太阳”,“绿色”和“天空”。我们将如何检测到它(最多三个连续的字母)

1 个答案:

答案 0 :(得分:2)

这是一种tidyverse可能性:

testframe %>%
 mutate_if(is.factor, as.character) %>%
 mutate(Res = ifelse(str_detect(str_extract(Address1, "^.{3}"), 
                          fixed(str_extract(Address2, "^.{3}"), ignore_case = TRUE)), "Yes", "No"))

    Address1 Address2 Res
1    Sunwood Sun wood Yes
2 Greengrass    green Yes
3    bluesky   sky Pl  No

它检查“地址1”中的前三个元素是否与“地址2”中的前三个元素匹配(无论如何)。如果是这样,则返回“是”,否则返回“否”。

或者手动将大小写设置为更低:

testframe %>%
 mutate_if(is.factor, as.character) %>%
 mutate(Res = ifelse(str_detect(tolower(str_extract(Address1, "^.{3}")), 
                                tolower(str_extract(Address2, "^.{3}"))), "Yes", "No"))

相同,但基于@PoGibas的思想进行了简化:

testframe %>%
 mutate_if(is.factor, as.character) %>%
 mutate(Res = ifelse(tolower(str_extract(Address1, "^.{3}")) == tolower(str_extract(Address2, "^.{3}")), "Yes", "No"))

或仅使用基数R:

testframe$Address1 <- as.character(testframe$Address1)  
testframe$Address2 <- as.character(testframe$Address2)

testframe$Res <- ifelse(tolower(sub("^(.{3}).*", "\\1", testframe$Address1)) %in% 
                         tolower(sub("^(.{3}).*", "\\1", testframe$Address2)), "Yes", "No")

    Address1 Address2 Res
1    Sunwood Sun wood Yes
2 Greengrass    green Yes
3    bluesky   sky Pl  No

或与@PoGibas的想法基本相同:

testframe$Res <- ifelse(tolower(substring(testframe$Address1, 1, 3)) %in% 
                         tolower(substring(testframe$Address2, 1, 3)), "Yes", "No")