使用dplyr :: filter问题创建R函数

时间:2018-09-28 17:50:26

标签: r filter dplyr rlang tidyeval

我已经查看了其他答案,但是找不到下面的代码起作用的解决方案。基本上,我正在创建一个函数,该函数根据在函数中输入的列inner_join和两个filter

问题是该函数的filter部分不起作用。但是,如果我将过滤器从功能中删除并像mydiff("a") %>% filter(a.x != a.y)

一样附加它,它将起作用

任何建议都是有帮助的。

请注意,我是函数输入(用引号引起来)

library(dplyr)

# fake data
df1<- tibble(id = seq(4,19,2), 
             a = c("a","b","c","d","e","f","g","h"), 
             b = c(rep("foo",3), rep("bar",5)))
df2<- tibble(id = seq(10, 20, 1), 
             a = c("d","a", "e","f","k","m","g","i","h", "a", "b"),
             b = c(rep("bar", 7), rep("foo",4)))

# What I am trying to do
dplyr::inner_join(df1, df2, by = "id") %>% select(id, b.x, b.y) %>% filter(b.x!=b.y)

#> # A tibble: 1 x 3
#>      id b.x   b.y  
#>   <dbl> <chr> <chr>
#> 1    18 bar   foo

# creating a function so that I can filter by difference in column if I have more columns
mydiff <- function(filteron, df_1 = df1, df_2 = df2){
  require(dplyr, warn.conflicts = F)
  col_1 = paste0(quo_name(filteron), "x")
  col_2 = paste0(quo_name(filteron), "y")
  my_df<- inner_join(df_1, df_2, by = "id", suffix = c("x", "y"))
  my_df %>% select(id, col_1, col_2) %>% filter(col_1 != col_2)
}

# the filter part is not working as expected. 
# There is no difference whether i pipe filter or leave it out
mydiff("a")

#> # A tibble: 5 x 3
#>      id ax    ay   
#>   <dbl> <chr> <chr>
#> 1    10 d     d    
#> 2    12 e     e    
#> 3    14 f     k    
#> 4    16 g     g    
#> 5    18 h     h

5 个答案:

答案 0 :(得分:5)

来自https://dplyr.tidyverse.org/articles/programming.html

  

大多数dplyr函数使用非标准评估(NSE)。这是一个笼统的术语,表示他们不遵循通常的R评估规则。

尝试将它们包装在函数中时,有时可能会产生一些问题。 这是您创建的函数的基本版本。

mydiff<- function(filteron, df_1=df1, df_2 = df2){

                 col_1 = paste0(filteron,"x")
                 col_2 = paste0(filteron, "y")

                 my_df <- merge(df1, df2, by="id", suffixes = c("x","y"))

                 my_df[my_df[, col_1] != my_df[, col_2], c("id", col_1, col_2)]  
         }

> mydiff("a")
  id ax ay
3 14  f  k
> mydiff("b")
  id  bx  by
5 18 bar foo

这将解决您的问题,并且可能会像现在和将来那样工作。通过减少对外部程序包的依赖,您可以减少这类问题和其他可能在将来随着程序包作者的发展而发展的怪癖。

答案 1 :(得分:1)

在我看来是一个评估问题。使用mydiff包,尝试使用经过修改的lazyeval函数:

mydiff <- function(filteron, df_1 = df1, df_2 = df2){
  require(dplyr, warn.conflicts = F)
  col_1 <- paste0(quo_name(filteron), "x")
  col_2 <- paste0(quo_name(filteron), "y")
  criteria <- lazyeval::interp(~ x != y, .values = list(x = as.name(col_1), y = as.name(col_2)))
  my_df <- inner_join(df_1, df_2, by = "id", suffix = c("x", "y"))
  my_df %>% select(id, col_1, col_2) %>% filter_(criteria)
}

您可以查看Hadley Wickham的书 Advanced R 中的Functions chapter,以了解更多信息。

答案 2 :(得分:1)

它在您的原始功能中不起作用的原因是col_1string,但是dplyr::filter()期望LHS的输入变量是“未引用”。因此,您需要首先使用col_1sym()转换为变量,然后使用filter将其!!中取消引用(bang bang)。

rlang具有非常好的功能qq_show,用于显示加引号/取消引号实际发生的情况(请参见下面的输出)

另请参阅类似的question

library(rlang)
library(dplyr)

# creating a function that can take either string or symbol as input
mydiff <- function(filteron, df_1 = df1, df_2 = df2) {

  col_1 <- paste0(quo_name(enquo(filteron)), "x")
  col_2 <- paste0(quo_name(enquo(filteron)), "y")

  my_df <- inner_join(df_1, df_2, by = "id", suffix = c("x", "y"))

  cat('\nwithout sym and unquote\n')
  qq_show(col_1 != col_2)

  cat('\nwith sym and unquote\n')
  qq_show(!!sym(col_1) != !!sym(col_2))
  cat('\n')

  my_df %>% 
    select(id, col_1, col_2) %>% 
    filter(!!sym(col_1) != !!sym(col_2))
}

### testing: filteron as a string
mydiff("a")
#> 
#> without sym and unquote
#> col_1 != col_2
#> 
#> with sym and unquote
#> ax != ay
#> 
#> # A tibble: 1 x 3
#>      id ax    ay   
#>   <dbl> <chr> <chr>
#> 1    14 f     k

### testing: filteron as a symbol
mydiff(a)
#> 
#> without sym and unquote
#> col_1 != col_2
#> 
#> with sym and unquote
#> ax != ay
#>  
#> # A tibble: 1 x 3
#>      id ax    ay   
#>   <dbl> <chr> <chr>
#> 1    14 f     k

reprex package(v0.2.1.9000)于2018-09-28创建

答案 3 :(得分:1)

将base R用于简单函数的建议很好,但是它不能扩展到更复杂的tidyverse函数,并且您无法移植到dplyr后端(如数据库)。如果要围绕tidyverse管道创建函数,则必须学习一些有关R表达式和无引号运算符!!的知识。我建议略读https://tidyeval.tidyverse.org的第一部分,以大致了解此处使用的概念。

由于您要创建的函数采用裸列名称且不涉及复杂的表达式(例如您将传递给mutate()summarise()),因此我们不需要像担保。我们可以使用符号。要创建符号,请使用as.name()rlang::sym()

as.name("mycolumn")
#> mycolumn

rlang::sym("mycolumn")
#> mycolumn

后者的优点是可以作为更大的功能家族:ensym()以及复数变体syms()ensyms()的一部分。我们将使用ensym()来捕获列名,即,延迟几列的执行,以便在进行一些转换后将其传递给dplyr。延迟执行称为“引用”。

我对您的函数的界面做了一些更改:

  • 首先获取数据帧以与dplyr函数保持一致

  • 不提供数据帧的默认值。这些默认值做出了太多假设。

  • 使bysuffix用户可配置,并具有合理的默认值。

这是代码,内联说明:

mydiff <- function(df1, df2, var, by = "id", suffix = c(".x", ".y")) {
  stopifnot(is.character(suffix), length(suffix) == 2)

  # Let's start by the easy task, joining the data frames
  df <- dplyr::inner_join(df1, df2, by = by, suffix = suffix)

  # Now onto dealing with the diff variable. `ensym()` takes a column
  # name and delays its execution:
  var <- rlang::ensym(var)

  # A delayed column name is not a string, it's a symbol. So we need
  # to transform it to a string in order to work with paste() etc.
  # `quo_name()` works in this case but is generally only for
  # providing default names.
  #
  # Better use base::as.character() or rlang::as_string() (the latter
  # works a bit better on Windows with foreign UTF-8 characters):
  var_string <- rlang::as_string(var)

  # Now let's add the suffix to the name:
  col1_string <- paste0(var_string, suffix[[1]])
  col2_string <- paste0(var_string, suffix[[2]])

  # dplyr::select() supports column names as strings but it is an
  # exception in the dplyr API. Generally, dplyr functions take bare
  # column names, i.e. symbols. So let's transform the strings back to
  # symbols:
  col1 <- rlang::sym(col1_string)
  col2 <- rlang::sym(col2_string)

  # The delayed column names now need to be inserted back into the
  # dplyr code. This is accomplished by unquoting with the !!
  # operator:
  df %>%
    dplyr::select(id, !!col1, !!col2) %>%
    dplyr::filter(!!col1 != !!col2)
}

mydiff(df1, df2, b)
#> # A tibble: 1 x 3
#>      id b.x   b.y
#>   <dbl> <chr> <chr>
#> 1    18 bar   foo

mydiff(df1, df2, "a")
#> # A tibble: 1 x 3
#>      id a.x   a.y
#>   <dbl> <chr> <chr>
#> 1    14 f     k

您还可以通过采用字符串而不是裸列名称来简化功能。在此版本中,我将使用syms()创建一个符号列表,并使用!!!一次将所有符号传递给select()

mydiff2 <- function(df1, df2, var, by = "id", suffix = c(".x", ".y")) {
  stopifnot(
    is.character(suffix), length(suffix) == 2,
    is.character(var), length(var) == 1
  )

  # Create a list of symbols from a character vector:
  cols <- rlang::syms(paste0(var, suffix))

  df <- dplyr::inner_join(df1, df2, by = by, suffix = suffix)

  # Unquote the whole list as once with the big bang !!!
  df %>%
    dplyr::select(id, !!!cols) %>%
    dplyr::filter(!!cols[[1]] != !!cols[[2]])
}

mydiff2(df1, df2, "a")
#> # A tibble: 1 x 3
#>      id a.x   a.y
#>   <dbl> <chr> <chr>
#> 1    14 f     k

答案 4 :(得分:1)

首先找到mydiff <- function(filteron, df_1 = df1, df_2 = df2){ require(dplyr, warn.conflicts = F) col_1 <- paste0(quo_name(filteron), "x") col_2 <- paste0(quo_name(filteron), "y") my_df <- inner_join(df_1, df_2, by = "id", suffix = c("x", "y")) %>% select(id, col_1, col_2) # find indices of different columns same <- my_df[, col_1] != my_df[, col_2] # return for the rows my_df[same, ] } my_diff("a") #> # A tibble: 1 x 3 #> id ax ay #> <dbl> <chr> <chr> #> 1 14 f k 的索引可能足以解决此问题。

 private final Object mutex = new Object();