我一直在阅读关于dplyr中的SE和NSE,并且遇到了我实际需要SE的问题。我有以下函数应该找到一些项匹配的行,但目标变量不是:
find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
inconsists <- df %>%
group_by_at(cols_to_use) %>%
summarise(uTargets = length(unique(get(target_column)))) %>%
filter(uTargets > 1)
}
这似乎适用于我的情况。但是, get(target_column)是一种解决方法,因为我需要变量的SE而不能对列名进行硬编码。我最初尝试使用SE版本(summarise_(.dots = ...)
),但无法找到用于评估target_column的正确语法。
我的问题如下:简单地使用get()
有什么缺点吗?这是不行的吗?任何风险/减速?简单地使用get
肯定比&#34;正确&#34;更具可读性。 SE语法。
答案 0 :(得分:5)
可以使用rlang
来完成NSE。
假设您的用例是:
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# # A tibble: 8 x 6
# # Groups: cyl, vs, am, gear [5]
# cyl vs am gear carb uTargets
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 4.00 1.00 0 4.00 2.00 2
# 2 4.00 1.00 1.00 4.00 1.00 4
# 3 4.00 1.00 1.00 4.00 2.00 2
# 4 6.00 1.00 0 3.00 1.00 2
# 5 6.00 1.00 0 4.00 4.00 2
# 6 8.00 0 0 3.00 2.00 4
# 7 8.00 0 0 3.00 3.00 3
# 8 8.00 0 0 3.00 4.00 4
你可以:
library(dplyr)
f2 <- function(df, target_column, cols_to_use) {
group_by_at(df, cols_to_use) %>%
summarise(uTargets = n_distinct(!! rlang::sym(target_column))) %>%
filter(uTargets > 1)
}
all.equal(
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")),
f2(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
)
# [1] TRUE
关于风险问题的实际答案:
现在假设您在全球环境中拥有foo <- 3
。比较:
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
# A tibble: 0 x 6
# Groups: cyl, vs, am, gear [0]
# ... with 6 variables: cyl <dbl>, vs <dbl>, am <dbl>, gear <dbl>,
# carb <dbl>, uTargets <int>
将静默返回空数据框,并且:
f2(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
# Error in summarise_impl(.data, dots) : variable 'foo' not found
会引发一个直接指向错误的错误。
修改
既然你似乎是在“整齐的方式”之后,我会推荐以下内容。潜在的哲学似乎是尽可能地阻止变量名称作为字符串使用,而不是作为裸名称:
f3 <- function(df, target_column, ...) {
target_column <- enquo(target_column)
cols_to_use <- quos(...)
group_by(df, !!! cols_to_use) %>%
summarise(uTargets = n_distinct(!! target_column)) %>%
filter(uTargets > 1)
}
all.equal(
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")),
f3(mtcars, target_column = mpg, cyl, vs, am, gear, carb)
)
# [1] TRUE
f3()
的接口也被设计成类似于其他整数函数的接口,并且可能更好地集成在转换的整齐管道中。
答案 1 :(得分:2)
@Aurele已经展示了如何使用rlang进行操作,但我认为看看我们是否可以使用get
进行工作会很有趣。正如我所指出的那样,get
的前几次尝试没有奏效,但经过一些实验后,这似乎可以正常运作。这并不是说我建议这样做,只是出于利益的缘故。
如果我们将摘要声明包装在do
中,那么我们可以像这样使用get(..., .)
,它将按预期工作。这可能是在get
中使用group by
的最简单,最直接的方式。关键的观察是,在do
内,点指的是当前组中的那些行,而在do
之外,它指的是在嵌套函数调用的实际参数中使用时输入的所有行。
find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
df %>%
group_by_at(cols_to_use) %>%
do(summarise(., uTargets = length(unique(get(target_column, .))))) %>%
filter(uTargets > 1)
}
# gives desired result
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# ... snip correct output ...
# correctly gives an error indicating it can't find `foo`
foo <- 3
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
df %>%
group_by_at(cols_to_use) %>%
summarise(uTargets = length(unique(get(target_column,
parent.env(parent.env(environment())), inherits = FALSE)))) %>%
filter(uTargets > 1)
}
# gives desired result
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# ... snip correct output ...
# correctly gives an error indicating it can't find `foo`
foo <- 3
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
## Error in summarise_impl(.data, dots) :
## Evaluation error: object 'foo' not found.
为了使这个解决方案更加简化,我们可以像这样定义GET
:
GET <- function(x) {
p <- parent.frame()
p3 <- parent.env(parent.env(p))
get(x, p3, inherits = FALSE)
}
find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
df %>%
group_by_at(cols_to_use) %>%
summarise(uTargets = length(unique(GET(target_column)))) %>%
filter(uTargets > 1)
}
# gives expected answer
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# gives expected error
foo <- 3
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
另一种可能性是按键列进行子集化。 mtcars
没有这样的列,但如果我们将行名称放入这样的列中,那么我们就会有一个:
library(tidyr)
find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
df %>%
rownames_to_column %>%
group_by_at(cols_to_use) %>%
summarise(uTargets = length(unique(
get(target_column, .[.$rowname %in% rowname, ])))) %>%
filter(uTargets > 1)
}
# gives expected answer
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# gives expected error
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))