查找R中两个字符串列之间的匹配项

时间:2019-09-23 14:18:47

标签: r stringr

为了解决标签迁移问题,我必须在两个字符列之间进行比较,并评估两个列之间是否存在重合。

总而言之,给定这样的数据框:

old_tags            new_tags
burger              burger, american
italian, pizza      italian
latin, peruvian     peruvian, latin
french              pizza

我想添加第三列,像这样:

old_tags            new_tags            match
burger              burger, american    TRUE
italian, pizza      italian             TRUE
latin, peruvian     peruvian, latin     TRUE
french              pizza               FALSE

直到现在,我还没有尝试使用str_matchstr_detect等功能。在比较实际上应为FALSE的成对字符串时,通常会返回我TRUE,例如我在[3,]中输入的示例。

非常感谢。

3 个答案:

答案 0 :(得分:2)

一种基本的R方法可能是用逗号分割字符串。如果存在至少一个相交的值,请使用Map查找相交的单词并创建一个逻辑值。

df$match <- lengths(Map(intersect, strsplit(df$old_tags, ", "), 
                    strsplit(df$new_tags, ", "))) > 0

df
#         old_tags         new_tags match
#1          burger burger, american  TRUE
#2  italian, pizza          italian  TRUE
#3 latin, peruvian  peruvian, latin  TRUE
#4          french            pizza FALSE

数据

df <- structure(list(old_tags = c("burger", "italian, pizza", "latin, peruvian", 
"french"), new_tags = c("burger, american", "italian", "peruvian, latin", 
"pizza")), row.names = c(NA, -4L), class = "data.frame")

答案 1 :(得分:1)

tidyverse-base的可能性:

library(dplyr)
library(stringr)

df %>% 
   mutate(patterns = map_chr(strsplit(old_tags, ", "),paste,collapse="|"),
          Match = str_detect(new_tags, patterns)) %>% 
   select(-patterns)
         old_tags         new_tags Match
1          burger burger, american  TRUE
2  italian, pizza          italian  TRUE
3 latin, peruvian  peruvian, latin  TRUE
4          french            pizza FALSE

答案 2 :(得分:0)

或者我们可以用str_extractany

library(tidyverse)
df %>% 
   mutate(match = map2_lgl(str_extract_all(old_tags, "\\w+"), 
               str_extract_all(new_tags, "\\w+"),  ~ any(.x %in% .y)))
#         old_tags         new_tags match
#1          burger burger, american  TRUE
#2  italian, pizza          italian  TRUE
#3 latin, peruvian  peruvian, latin  TRUE
#4          french            pizza FALSE

数据

df <- structure(list(old_tags = c("burger", "italian, pizza", "latin, peruvian", 
"french"), new_tags = c("burger, american", "italian", "peruvian, latin", 
"pizza")), row.names = c(NA, -4L), class = "data.frame")