Question

我有一个带有连接字符串的数据框，其后11位是人口普查区。我有一个单独的字符串列表，其中最后的2或5位数分别代表州或县。我在2位或5位数字的结尾处连接了*。我需要浏览数据框并标记trans变量（人口普查区）是否在patterns向量（州或县）中，允许*代表剩余的9或trans中的6位数字。

如下面的代码所示，我通过将所有pattern合并为一个包含collapse="|"的单个字符串以及grepl两个字符串来实现此功能。但是，我想知道我是否可以通过向量操作实现这一点，因为1）感觉我应该能够，2）在实践中，模式列表是巨大的，将它们放入单个字符感觉很愚蠢变量。

是否有与%in%运算符类似的内容，但支持正则表达式/通配符？

library(dplyr)

trans <- c("1-IA-45045000100",
           "2-IA-23003001801",
           "3-LITP-01001000100",
           "4-OTP-06006000606",
           "4-OTP-06010001001",
           "1-IA-45001010002",
           "2-IA-45045000101",
           "2-LITP-23005005002")
df <- data.frame(id = 1:8, trans)

patterns <- c("1-IA-45*",
              "2-LITP-23005*",
              "4-OTP-06*")

# This works, but I'm looking for a better way
patterns_string <- paste(patterns, collapse="|")
df <- df %>% mutate(match = ifelse(grepl(patterns_string, df$trans), TRUE, FALSE))

# Is there anyway to keep the patterns in a vector and check for whether they
# any of them grepl with each row or my data frame or to use %in% with a 
# wildcard character?

# "argument 'pattern' has length > 1 and only first element will be used" 
df <- df %>% mutate(match = ifelse(grepl(patterns, df$trans), TRUE, FALSE))

# Can't take advantage of the 'wild character '*'
df <- df %>% mutate(match = trans %in% patterns)

Answer 1

您可以通过std::string s.c_str()通过grepl()运行每个模式，然后使用lapply()与逻辑“或”运算符Reduce()合并结果。

Answer 2

以下是tidyverse使用stri_detect

stringi的选项

library(stringi)
library(tidyverse)
patterns %>%
      map(~stri_detect_regex(df$trans, .)) %>% 
      reduce(`|`) %>%
      mutate(df, match = .)
#  id              trans match
#1  1   1-IA-45045000100  TRUE
#2  2   2-IA-23003001801 FALSE
#3  3 3-LITP-01001000100 FALSE
#4  4  4-OTP-06006000606  TRUE
#5  5  4-OTP-06010001001  TRUE
#6  6   1-IA-45001010002  TRUE
#7  7   2-IA-45045000101 FALSE
#8  8 2-LITP-23005005002  TRUE

R：带有通配符/ REGEX的％运算符％

2 个答案: