使用purrr迭代两个列表然后管道进入dplyr :: filter

时间:2018-01-07 22:39:35

标签: r dplyr tidyverse purrr rlang

library(tidyverse)
library(purrr)

使用下面的示例数据,我可以创建以下功能:

Funs <- function(DF, One, Two){

    One <- enquo(One)
    Two <- enquo(Two)

    DF %>% filter(School == (!!One) & Code == (!!Two)) %>%
        group_by(Code, School) %>%
        summarise(Count = sum(Question1))
}

然后我可以使用该函数来过滤两个变量 - 学校和代码 - 像这样:

Funs(DF, "School1", "B344")

这一切都很好,但我的实际数据有很多变量,因此我不想不断地在函数中输入“School”和“Code”变量,我想使用tidyverse和purrr包来循环遍历两个列表(学校之一,代码之一)并将其提供给过滤器。我希望输出结果列表。

为了简单起见,输入dplyr :: filter的两个列表每个只有两个值:School2将使用S300,School1将使用B344,就像上面的示例一样。

我尝试过的一些例子:

map2(c(“School2”, ”School1”),
     c(“S300”, ”B344”),
     function(x,y) {
         DF %>% filter(School == .x & Code == .y) %>%
             group_by(Code, School) %>%
             summarise(Count = sum(Question1))
     }

也...

map2(c("School2", "School1")),
     c("S300","B344"),
     ~filter(School == .x & Code == .y) %>%
         group_by(Code, School)%>%
         summarise(Count = sum(Question1))

这就是......

list(c("School2", "School1"), c("S300", "B344")) %>%
    map2( ~ filter(School == .x & Code == .y) %>%
             group_by(Code, School) %>%
             summarise(Count = sum(Question1)))

这些似乎都不起作用,所以请帮助我们!

示例数据:

Code <- c("B344","B555","S300","T220","B888","B888","B555","B344","B344","T220","B555","B555","S300","B555","S300","S300","S300","S300","B344","B344","B888","B888","B888")
School <- c("School1","School1","School2","School3","School4","School4","School1","School1","School3","School3","School4","School1","School1","School3","School2","School2","School4","School2","School3","School4","School3","School1","School2")
Question1 <- c(3,4,5,4,5,5,5,4,5,3,4,5,4,5,4,3,3,3,4,5,4,3,3)
Question2 <- c(5,4,3,4,3,5,4,3,2,3,4,5,4,5,4,3,4,4,5,4,3,3,4)
DF <- data_frame(Code, School, Question1, Question2)

1 个答案:

答案 0 :(得分:1)

以下是一些选项,从大多数代码到最佳代码:

library(tidyverse)

DF <- data_frame(Code = c("B344", "B555", "S300", "T220", "B888", "B888", "B555", "B344", "B344", "T220", "B555", "B555", "S300", "B555", "S300", "S300", "S300", "S300", "B344", "B344", "B888", "B888", "B888"), 
                 School = c("School1", "School1", "School2", "School3", "School4", "School4", "School1", "School1", "School3", "School3", "School4", "School1", "School1", "School3", "School2", "School2", "School4", "School2", "School3", "School4", "School3", "School1", "School2"), 
                 Question1 = c(3, 4, 5, 4, 5, 5, 5, 4, 5, 3, 4, 5, 4, 5, 4, 3, 3, 3, 4, 5, 4, 3, 3), 
                 Question2 = c(5, 4, 3, 4, 3, 5, 4, 3, 2, 3, 4, 5, 4, 5, 4, 3, 4, 4, 5, 4, 3, 3, 4))

wanted <- data_frame(School = c("School2", "School1"),
                     Code = c("S300", "B344"))

要使map2正常工作,如果使用代字符表示法,则变量名为.x.y;如果你使用常规函数表示法,你可以随意调用它们。不要忘记filter的第一个参数是管道输入的数据框,所以:

map2_dfr(wanted$School, wanted$Code, ~filter(DF, School == .x, Code == .y)) %>% 
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

由于我将wanted设置为数据框(香草列表也可以使用),因此您可以使用pmap。对于两个变量,带有pmap的参数名称实际上可能与map2相同,但它实际上是一个带有...参数的函数,因此以不同方式处理它们通常是有意义的,例如使用..1表示法:

wanted %>% 
    pmap_dfr(~filter(DF, School == ..1, Code == ..2)) %>% 
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

上述两种技术的问题在于,它们会很慢,因为它们对filter的每一行都运行wanted,这意味着您要多次重新测试每一行。为了使代码保持相似,避免额外工作的一种稍微麻烦的方法是将列组合成一个,例如,与tidyr::unite

DF %>% 
    unite(school_code, School, Code) %>% 
    filter(school_code %in% invoke(paste, wanted, sep = '_')) %>%    # or paste(wanted$School, wanted$Code, sep = '_') or equivalent
    separate(school_code, c('School', 'Code')) %>%
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

...或者只是将它们组合在filter内:

DF %>% 
    filter(paste(School, Code) %in% paste(wanted$School, wanted$Code)) %>%    # or invoke(paste, wanted)
    group_by(School, Code) %>% 
    summarise_all(sum)
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0

最佳获得所需结果的方式可能更明显,因为我已将wanted设置为数据框:一个连接,旨在完成此工作:

DF %>% 
    inner_join(wanted) %>% 
    group_by(School, Code) %>% 
    summarise_all(sum)
#> Joining, by = c("Code", "School")
#> # A tibble: 2 x 4
#> # Groups: School [?]
#>   School  Code  Question1 Question2
#>   <chr>   <chr>     <dbl>     <dbl>
#> 1 School1 B344       7.00      8.00
#> 2 School2 S300      15.0      14.0