实施例

Question

借鉴conditional dplyr evaluation的讨论，我希望有条件地在管道中执行一个步骤，具体取决于传递的数据框中是否存在引用列。

实施例

2) 和 # 1) mtcars %>% filter(am == 1) %>% filter(cyl == 4) # 2) mtcars %>% filter(am == 1) %>% { if("cyl" %in% names(.)) filter(cyl == 4) else . } 生成的结果应该相同。

现有列

# 1)
mtcars %>% 
  filter(am == 1)

# 2)    
mtcars %>%
  filter(am == 1) %>%
  {
    if("absent_column" %in% names(.)) filter(absent_column == 4) else .
  }

不可用的列

filter(cyl == 4)

问题

对于可用列，传递的对象与初始数据帧不对应。原始代码返回错误消息：

'cyl'中的错误：找不到对象>> mtcars %>% ... filter(am == 1) %>% ... { ... if("cyl" %in% names(.)) filter(.$cyl == 4) else . ... } Show Traceback Rerun with Debug Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "logical"

我尝试了其他语法（没有运气）：

==

的后续

我想在 filter 电话中扩展这个问题，以解决 filter({ if ("does_not_ex" %in% names(.)) does_not_ex else NULL } == { if ("does_not_ex" %in% names(.)) unique(.[['does_not_ex']]) else NULL }) 右侧的评估问题。例如，下面的语法试图过滤第一个可用值。 mtcars％＆gt;％

filter_impl(.data, quo)

预计，调用将评估错误消息：

mtcars %>% filter({ if ("mpg" %in% names(.)) mpg else NULL } == { if ("mpg" %in% names(.)) unique(.[['mpg']]) else NULL })中的错误：结果的长度必须为32，而不是0

应用于现有列时：

  mpg cyl disp  hp drat   wt  qsec vs am gear carb
1  21   6  160 110  3.9 2.62 16.46  0  1    4    4

它与警告信息一起使用：

警告消息：在filter中：较长的对象长度不是倍数较短的物体长度

后续问题

是否有一种简洁的方式来扩展现有语法，以便在{{1}}调用的右侧进行条件评估，最好保持在dplyr工作流中？

Answer 1

由于此处作用域的工作方式，您无法从if语句中访问数据框。幸运的是，你不需要。

尝试：

mtcars %>%
  filter(am == 1) %>%
  filter({if("cyl" %in% names(.)) cyl else NULL} == 4)

在这里，您可以使用条件中的“.”对象，以便检查列是否存在，如果存在，则可以将列返回到filter函数。

编辑：根据docendo discimus'对该问题的评论，您可以访问数据框但不是隐含的 - 即您必须使用.

专门引用它

Answer 2

在dplyr> 1.0.0中使用across()时，现在可以在过滤时使用any_of。将原始列与所有列进行比较：

mtcars %>% 
  filter(am == 1) %>% 
  filter(cyl == 4)

删除cyl会引发错误：

mtcars %>% 
  select(!cyl) %>% 
  filter(am == 1) %>% 
  filter(cyl == 4)

使用any_of（请注意，您必须写"cyl"而不是cyl）：

mtcars %>% 
  select(!cyl) %>% 
  filter(am == 1) %>% 
  filter(across(any_of("cyl"), ~.x == 4))
#N.B. this is equivalent to just filtering by `am == 1`.

Answer 3

避免此陷阱：

在忙碌的一天中，您可能会喜欢以下内容：

library(dplyr)
df <- data.frame(A = 1:3, B = letters[1:3], stringsAsFactors = F)
> df %>% mutate( C = ifelse("D" %in% colnames(.), D, B)) 
# Notice the values on "C" colum. No error thrown, but the logic and result is wrong
  A B C
1 1 a a
2 2 b a
3 3 c a

为什么？由于"D" %in% colnames(.)仅返回TRUE或FALSE的一个值，因此ifelse仅操作一次。然后，该值将广播到整个列！

正确的方式：

> df %>% mutate( C = if("D" %in% colnames(.)) D else B)
  A B C
1 1 a a
2 2 b b
3 3 c c

Answer 4

我知道我参加晚会很晚，但这是一个更符合您最初想法的答案：

mtcars %>%
  filter(am == 1) %>%
  {
    if("cyl" %in% names(.)) filter(., cyl == 4) else .
  }

基本上，您在.中缺少filter。请注意，这是因为管道不在.包围的表达式中，因此不会将filter(expr)添加到{}。

Answer 5

编辑：不幸的是，这太好了，难以置信

我参加聚会可能有点晚了。但是是

mtcars %>% 
 filter(am == 1) %>%
 try(filter(absent_column== 4))

解决方案？

Answer 6

此代码可以解决问题，并且非常灵活。 ^和$是用于执行精确匹配的正则表达式。

mtcars %>% 
  set_names(names(.) %>% 
              str_replace("am","1") %>% 
              str_replace("^cyl$","2") %>% 
              str_replace("Doesn't Exist","3")
              )

仅当列存在时才执行dplyr操作

实施例

现有列

不可用的列

问题

的后续

后续问题

6 个答案:

避免此陷阱：

正确的方式：