Question

如何使用正则表达式perl = TRUE选择列。

data.frame(baa=0,boo=0,boa=0,lol=0,bAa=0) %>% dplyr::select(matches("(?i)b(?!a)"))

grep（needle，haystack，...）出错：正则表达式无效＆＃39;（？i）b（？！a）＆＃39;，原因＆＃39;无效的正则表达式＆＃39;

正则表达式确实有效。

grep("(?i)b(?!a)",c("baa","boo","boa","lol","bAa"),perl=T)

> [1] 2 3

是否有快捷功能/方式？

Answer 1

matches中的

dplyr不支持perl = TRUE。但是，您可以创建自己的功能。在对源代码进行一些挖掘之后，这可以工作：

快速的方式：

library(dplyr)

#notice the 3 colons because grep_vars is not exported from dplyr
matches2 <- function (match, ignore.case = TRUE, vars = current_vars()) 
{
  dplyr:::grep_vars(match, vars, ignore.case = ignore.case, perl = TRUE)
}

data.frame(baa=0,boo=0,boa=0,lol=0,bAa=0) %>% select(matches2("(?i)b(?!a)"))
#boo boa
#1   0   0

或更具解释性的解决方案：

matches2 <- function (match, ignore.case = TRUE, vars = current_vars()) 
{
  grep_vars2(match, vars, ignore.case = ignore.case)
}

#this is pretty much my only change in the original dplyr:::grep_vars
#to make it accept perl.
grep_vars2 <- function (needle, haystack, ...) 
{
  grep(needle, haystack, perl = TRUE, ...)
}

 data.frame(baa=0,boo=0,boa=0,lol=0,bAa=0) %>% 
   select(matches2("(?i)b(?!a)"))
 #boo boa
 #1   0   0

Answer 2

另一种方法，虽然沿袭并且可能比LyzandeR的建议更危险：

body(matches)[[grep("grep_vars", body(matches))]] <- substitute(grep_vars(match, vars, ignore.case = ignore.case, perl=T))

data.frame(baa=0,boo=0,boa=0,lol=0,bAa=0) %>% dplyr::select(matches("(?i)b(?!a)"))
  boo boa
1   0   0

我不会使用body(matches)[[3]]，因为任何更新都会导致这个小补丁产生问题。

Answer 3

作为对LyzandeRs的修订/附注，在这里回答一个不使用dplyr词汇的版本，只使用magrittr管道。因此，可以跳过编写包装函数和指定参数等。

这比dplyr更冗长。但它比base更简洁，并允许使用grep或stringi::stri_detect等任何功能的完全灵活性。

它明显更快。检查以下基准。当然，应该注意的是，对于更大的例子，必须检查速度，对于这个小例子，dplyr的开销非常大，因此，公平的速度比较取决于用例。

df <- data.frame(baa=0,boo=0,boa=0,lol=0,bAa=0)

library(magrittr)
df %>% 
.[,grep("(?i)b(?!a)", names(.), perl = T)]
#    boo boa
# 1   0   0

#in the following a copy of LyzanderRs approaches
library(dplyr)
matches2 <- function (match, ignore.case = TRUE, vars = current_vars()) {
                      dplyr:::grep_vars(match, vars, ignore.case = ignore.case, perl = TRUE)
                      }

grep_vars2 <- function (needle, haystack, ...) {
                        grep(needle, haystack, perl = TRUE, ...)
                        }

matches3 <- function (match, ignore.case = TRUE, vars = current_vars()) {
                      grep_vars2(match, vars, ignore.case = ignore.case)
                      }

library(microbenchmark)
microbenchmark(
  df %>% select(matches2("(?i)b(?!a)")),
  df %>% select(matches3("(?i)b(?!a)")),
  df %>% .[,grep("(?i)b(?!a)", names(.), perl = T)]
)

# Unit: microseconds
#                 expr                                 min       lq      mean     median        uq       max    neval
# df %>% select(matches2("(?i)b(?!a)"))              3994.867 4309.877 4570.6414 4555.8065 4726.9310  6618.769   100
# df %>% select(matches3("(?i)b(?!a)"))              3981.841 4177.834 4792.2025 4396.3275 4655.6780 31812.876   100
# df %>% .[, grep("(?i)b(?!a)", names(.), perl = T)]  183.164  210.797  242.1678  237.2455  263.6935   554.624   100

在dplyr select中使用perl = TRUE正则表达式

3 个答案: