Question

我有几个正则表达式的向量。它们是如此之短且如此不同，以至于不值得尝试制作一个可同时捕获所有它们的单个正则表达式。

我还有一个数据表，其中有两列，一列包含字符串，另一列是ID，每个ID包含多个字符串。

我需要找出的是哪个ID，至少一个关联的ID与至少一个正则表达式之间没有匹配。

MWE：

icd10.autol.tr.regex <- c("C91\\.00", "C92\\.00", "D61\\.\\d{1,2}", "C91\\.10", "C92\\.10", 
                          "Q82\\.8", "D76\\.1", "C81\\.\\d{1,2}", "E76\\.0", "C90\\.00", 
                          "C94\\.60", "C85\\.9", "Q78\\.2", "D59\\.5", "D57\\.1", 
                          "D56\\.\\d{1,2}", "D82\\.\\d{1,2}", "C86\\.4", "C93\\.3\\d", 
                          "C91\\.6\\d")

codes.to.check <- data.frame(code=c("E85.3",  "C90.00", "Z45.20", "N08.4",  "Z29.21", 
                                    "Z52.01",  "C79.3",  "Z45.20", "F05.9",  "B99", "A04.7",  
                                    "R63.3"), 
                             id=c(1,1,1,1,1,1,2,2,2,2,2,2))

在这里，我希望结果看起来像

ID  result 
1   TRUE      #because we matched C90.00
2   FALSE     #no match

如果我将可能的代码列表作为字符串，则应该使用％in％。对于正则表达式，我尝试使用stringr中的str_extract，但似乎并没有使用矢量作为锯齿状图案。我想我可以用str_extract进行嵌套循环，但这感觉效率很低。还有更惯用的方法吗？

Answer 1

我们可以使用paste |来处理正则表达式向量的元素，并在str_detect中使用它，并用any包裹起来，以检查它们是否匹配'id '

library(dplyr)
library(stringr)
codes.to.check %>%
    group_by(id) %>%
    summarise(result = any(str_detect(code, str_c(icd10.autol.tr.regex, 
           collapse="|"))))
# A tibble: 2 x 2
#     id result
#  <dbl> <lgl> 
#1     1 TRUE  
#2     2 FALSE

Answer 2

base中的另一种解决方案：

aggregate(result ~ id, 
          cbind(id = codes.to.check$id, 
                result = grepl(paste(icd10.autol.tr.regex, collapse="|"),
                               codes.to.check$code)), 
          any)

#>   id result
#> 1  1   TRUE
#> 2  2  FALSE

查找一组中的至少一个元素是否匹配至少一个模式

2 个答案: