Question

我一直在阅读关于dplyr中的SE和NSE，并且遇到了我实际需要SE的问题。我有以下函数应该找到一些项匹配的行，但目标变量不是：

find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
  inconsists <- df %>% 
    group_by_at(cols_to_use) %>% 
    summarise(uTargets = length(unique(get(target_column)))) %>% 
    filter(uTargets > 1)
}

这似乎适用于我的情况。但是， get（target_column）是一种解决方法，因为我需要变量的SE而不能对列名进行硬编码。我最初尝试使用SE版本（summarise_(.dots = ...)），但无法找到用于评估target_column的正确语法。

我的问题如下：简单地使用get()有什么缺点吗？这是不行的吗？任何风险/减速？简单地使用get肯定比＆＃34;正确＆＃34;更具可读性。 SE语法。

Answer 1

可以使用rlang来完成NSE。

假设您的用例是：

find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# # A tibble: 8 x 6
# # Groups:   cyl, vs, am, gear [5]
#     cyl    vs    am  gear  carb uTargets
#   <dbl> <dbl> <dbl> <dbl> <dbl>    <int>
# 1  4.00  1.00  0     4.00  2.00        2
# 2  4.00  1.00  1.00  4.00  1.00        4
# 3  4.00  1.00  1.00  4.00  2.00        2
# 4  6.00  1.00  0     3.00  1.00        2
# 5  6.00  1.00  0     4.00  4.00        2
# 6  8.00  0     0     3.00  2.00        4
# 7  8.00  0     0     3.00  3.00        3
# 8  8.00  0     0     3.00  4.00        4

你可以：

library(dplyr)

f2 <- function(df, target_column, cols_to_use) {
  group_by_at(df, cols_to_use) %>% 
    summarise(uTargets = n_distinct(!! rlang::sym(target_column))) %>% 
    filter(uTargets > 1)
}

all.equal(
  find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")),
  f2(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
)
# [1] TRUE

关于风险问题的实际答案：

现在假设您在全球环境中拥有foo <- 3。比较：

find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
# A tibble: 0 x 6
# Groups:   cyl, vs, am, gear [0]
# ... with 6 variables: cyl <dbl>, vs <dbl>, am <dbl>, gear <dbl>,
#   carb <dbl>, uTargets <int>

将静默返回空数据框，并且：

f2(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
# Error in summarise_impl(.data, dots) : variable 'foo' not found

会引发一个直接指向错误的错误。

修改

既然你似乎是在“整齐的方式”之后，我会推荐以下内容。潜在的哲学似乎是尽可能地阻止变量名称作为字符串使用，而不是作为裸名称：

f3 <- function(df, target_column, ...) {
  target_column <- enquo(target_column)
  cols_to_use <- quos(...)
  group_by(df, !!! cols_to_use) %>% 
    summarise(uTargets = n_distinct(!! target_column)) %>% 
    filter(uTargets > 1)
}
all.equal(
  find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb")),
  f3(mtcars, target_column = mpg, cyl, vs, am, gear, carb)
)
# [1] TRUE

f3()的接口也被设计成类似于其他整数函数的接口，并且可能更好地集成在转换的整齐管道中。

Answer 2

@Aurele已经展示了如何使用rlang进行操作，但我认为看看我们是否可以使用get进行工作会很有趣。正如我所指出的那样，get的前几次尝试没有奏效，但经过一些实验后，这似乎可以正常运作。这并不是说我建议这样做，只是出于利益的缘故。

1。获得/做

如果我们将摘要声明包装在do中，那么我们可以像这样使用get(..., .)，它将按预期工作。这可能是在get中使用group by的最简单，最直接的方式。关键的观察是，在do内，点指的是当前组中的那些行，而在do之外，它指的是在嵌套函数调用的实际参数中使用时输入的所有行。

find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
  df %>% 
    group_by_at(cols_to_use) %>% 
    do(summarise(., uTargets = length(unique(get(target_column, .))))) %>% 
    filter(uTargets > 1)
}

# gives desired result
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# ... snip correct output ...

# correctly gives an error indicating it can't find `foo`
foo <- 3
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))

2。通过inherits = FALSE

进入祖父母

find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
  df %>% 
    group_by_at(cols_to_use) %>% 
    summarise(uTargets = length(unique(get(target_column,
       parent.env(parent.env(environment())), inherits = FALSE)))) %>% 
    filter(uTargets > 1)
}

# gives desired result
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))
# ... snip correct output ...

# correctly gives an error indicating it can't find `foo`
foo <- 3
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))
## Error in summarise_impl(.data, dots) : 
##   Evaluation error: object 'foo' not found.

为了使这个解决方案更加简化，我们可以像这样定义GET：

GET <- function(x) {
  p <- parent.frame()
  p3 <- parent.env(parent.env(p))
  get(x, p3, inherits = FALSE)
}

find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
  df %>% 
    group_by_at(cols_to_use) %>% 
    summarise(uTargets = length(unique(GET(target_column)))) %>% 
    filter(uTargets > 1)
}

# gives expected answer    
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))

# gives expected error
foo <- 3
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))

3。按键列的子集

另一种可能性是按键列进行子集化。 mtcars没有这样的列，但如果我们将行名称放入这样的列中，那么我们就会有一个：

library(tidyr)
find_dataset_inconsistencies <- function(df, target_column, cols_to_use) {
  df %>% 
    rownames_to_column %>%
    group_by_at(cols_to_use) %>% 
    summarise(uTargets = length(unique(
        get(target_column, .[.$rowname %in% rowname, ])))) %>% 
    filter(uTargets > 1)
}

# gives expected answer
find_dataset_inconsistencies(mtcars, "mpg", c("cyl", "vs", "am", "gear", "carb"))

# gives expected error
find_dataset_inconsistencies(mtcars, "foo", c("cyl", "vs", "am", "gear", "carb"))

在dplyr而不是SE中使用get（）是否有缺点？

2 个答案:

1。获得/做

2。通过inherits = FALSE

3。按键列的子集