Question

我有一些看起来像这样的R代码：

library(dplyr)
library(datasets)

iris %.% group_by(Species) %.% filter(rank(Petal.Length, ties.method = 'random')<=2) %.% ungroup()

，并提供：

Source: local data frame [6 x 5]

  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          4.3         3.0          1.1         0.1     setosa
2          4.6         3.6          1.0         0.2     setosa
3          5.0         2.3          3.3         1.0 versicolor
4          5.1         2.5          3.0         1.1 versicolor
5          4.9         2.5          4.5         1.7  virginica
6          6.0         3.0          4.8         1.8  virginica

按物种分组，每组只保留最短的Petal.Length。我的代码中有一些重复，因为我对不同的列和数字执行了几次。 E.g：

iris %.% group_by(Species) %.% filter(rank(Petal.Length, ties.method = 'random')<=2) %.% ungroup()
iris %.% group_by(Species) %.% filter(rank(-Petal.Length, ties.method = 'random')<=2) %.% ungroup()
iris %.% group_by(Species) %.% filter(rank(Petal.Width, ties.method = 'random')<=3) %.% ungroup()
iris %.% group_by(Species) %.% filter(rank(-Petal.Width, ties.method = 'random')<=3) %.% ungroup()

我想把它提取成一个函数。天真的方法不起作用：

keep_min_n_by_species <- function(expr, n) {
  iris %.% group_by(Species) %.% filter(rank(expr, ties.method = 'random') <= n) %.% ungroup()
}

keep_min_n_by_species(Petal.Width, 2)

Error in filter_impl(.data, dots(...), environment()) : 
  object 'Petal.Width' not found

据我了解，表达式rank(Petal.Length, ties.method = 'random') <= 2是在filter函数引入的不同上下文中计算的，它为Petal.Length表达式提供了含义。我不能只为Petal.Length换一个变量，因为它将在错误的上下文中进行评估。我已尝试使用substitute和eval的不同组合，已阅读此页面：Non-standard evaluation。我无法找到合适的组合。我认为问题可能是我不想通过调用者（Petal.Length）到filter的表达式来进行评估 - 我想构建一个新的更大的表达式（rank(Petal.Length, ties.method = 'random') <= 2）然后将整个表达式传递给filter，以便进行评估。

如何将此表达式重构为函数？
更一般地说，我应该如何将R表达式提取到函数中？
更一般地说，我是否以错误的心态接近这个？在我熟悉的主流语言（例如Python，C ++，C＃）中，这是一个相对简单的操作，我想一直做以删除代码中的重复。在R中，似乎（对我来说，至少）非标准评估可以使其成为非常明显的操作。我应该完全做其他事吗？

Answer 1

dplyr版本0.3开始使用lazyeval包开始解决这个问题，正如@baptiste所提到的，以及使用标准评估的新系列函数（与NSE版本相同的函数名称，但结束在_）。这里有一个小插图：https://github.com/hadley/dplyr/blob/master/vignettes/nse.Rmd

所有这一切，我不知道你想要做什么的最佳实践（虽然我试图做同样的事情）。我有一些工作，但就像我说的，我不知道这是否是最好的方法。请注意使用filter_()而不是filter()，并将参数作为带引号的字符串传递：

devtools::install_github("hadley/dplyr")
devtools::install_github("hadley/lazyeval")

library(dplyr)
library(lazyeval)

keep_min_n_by_species <- function(expr, n, rev = FALSE) {
  iris %>% 
    group_by(Species) %>% 
    filter_(interp(~rank(if (rev) -x else x, ties.method = 'random') <= y, # filter_, not filter
                   x = as.name(expr), y = n)) %>% 
    ungroup()
}

keep_min_n_by_species("Petal.Width", 3) # "Petal.Width" as character string
keep_min_n_by_species("Petal.Width", 3, rev = TRUE)

根据@ hadley的评论进行更新：

keep_min_n_by_species <- function(expr, n) {
  expr <- lazy(expr)

  formula <- interp(~rank(x, ties.method = 'random') <= y,
                    x = expr, y = n)

  iris %>% 
    group_by(Species) %>% 
    filter_(formula) %>% 
    ungroup()
}

keep_min_n_by_species(Petal.Width, 3)
keep_min_n_by_species(-Petal.Width, 3)

Answer 2

怎么样

keep_min_n_by_species <- function(expr, n) {
    mc <- match.call()
    fx <- bquote(rank(.(mc$expr), ties.method = 'random') <= .(mc$n))
    iris %.% group_by(Species) %.% filter(fx) %.% ungroup()
}

这似乎允许所有语句无错误地运行

keep_min_n_by_species(Petal.Width, 2)
keep_min_n_by_species(-Petal.Width, 2)
keep_min_n_by_species(Petal.Width, 3)
keep_min_n_by_species(-Petal.Width, 3)

我们的想法是使用match.call()来捕获传递给函数的未评估表达式。然后我们使用bquote()将过滤器构建为调用对象。

当库函数使用非标准求值时，重构R代码

2 个答案: