dplyr管道多个数据集来总结()

时间:2014-12-19 18:10:15

标签: r dplyr

我正在使用dplyr创建一个表。我想对多个数据集执行相同的“汇总”命令。我知道在ggplot2中,您可以更改数据集并重新运行绘图,这很酷。

这是我想要避免的:

table_1 <- 
group_by(df_1, boro) %>%
  summarize(n_units = n(),
            mean_rent = mean(rent_numeric, na.rm = TRUE),
            sd_rend = sd(rent_numeric,na.rm = TRUE),
            median_rent = median(rent_numeric, na.rm = TRUE),
            mean_bedrooms = mean(bedrooms_numeric, na.rm = TRUE),
            sd_bedrooms = sd(bedrooms_numeric, na.rm = TRUE),
            mean_sqft = mean(sqft, na.rm = TRUE),
            sd_sqft = sd(sqft, na.rm = TRUE),
            n_broker = sum(ob=="broker"),
            pr_broker = n_broker/n_units)

table_2 <- 
group_by(df_2, boro) %>%
  summarize(n_units = n(),
            mean_rent = mean(rent_numeric, na.rm = TRUE),
            sd_rend = sd(rent_numeric,na.rm = TRUE),
            median_rent = median(rent_numeric, na.rm = TRUE),
            mean_bedrooms = mean(bedrooms_numeric, na.rm = TRUE),
            sd_bedrooms = sd(bedrooms_numeric, na.rm = TRUE),
            mean_sqft = mean(sqft, na.rm = TRUE),
            sd_sqft = sd(sqft, na.rm = TRUE),
            n_broker = sum(ob=="broker"),
            pr_broker = n_broker/n_units)

基本上,有没有办法将汇总命令设置为函数或者某些东西,所以我可以直接注入df_1和df_2?

2 个答案:

答案 0 :(得分:4)

如果您事先知道所有变量名称,并且在您想要查看的所有数据集中它们是相同的,您可以执行以下操作:

myfunc <- function(df) {
  df %>% 
  group_by(cyl) %>%
    summarize(n = n(),
              mean_hp = mean(hp))
}

myfunc(mtcars)
#Source: local data frame [3 x 3]
#
#  cyl  n   mean_hp
#1   4 11  82.63636
#2   6  7 122.28571
#3   8 14 209.21429

然后将其与不同的数据集(具有相同的结构和变量名称)一起使用。如果您需要灵活性,即您事先并不知道所有变量以及在函数中将它们指定为输入的内容,请查看dplyr non standard evaluation vignette

这里只是一个很小的例子,说明如何在您的函数中实现“标准评估”以提供更大的灵活性。考虑是否要允许函数的用户指定数据应分组到哪一列,您可以这样做:

myfunc <- function(df, grp) {
      df %>% 
      group_by_(grp) %>%        # notice that I use "group_by_" instead of "group_by"
        summarize(n = n(),
                  mean_hp = mean(hp))
}

and then use it:

myfunc(mtcars, "gear")
#Source: local data frame [3 x 3]
#
#  gear  n  mean_hp
#1    3 15 176.1333
#2    4 12  89.5000
#3    5  5 195.6000

myfunc(mtcars, "cyl")
#Source: local data frame [3 x 3]
#
#  cyl  n   mean_hp
#1   4 11  82.63636
#2   6  7 122.28571
#3   8 14 209.21429

答案 1 :(得分:3)

%>%运算符只传递一个tbl对象作为下一个函数的第一个参数。 summarize只需要一个tbl。所以你可以定义

mysummary <- function(.data) {
  summarize(.data, n_units = n(),
            mean_rent = mean(rent_numeric, na.rm = TRUE),
            sd_rend = sd(rent_numeric,na.rm = TRUE),
            median_rent = median(rent_numeric, na.rm = TRUE),
            mean_bedrooms = mean(bedrooms_numeric, na.rm = TRUE),
            sd_bedrooms = sd(bedrooms_numeric, na.rm = TRUE),
            mean_sqft = mean(sqft, na.rm = TRUE),
            sd_sqft = sd(sqft, na.rm = TRUE),
            n_broker = sum(ob=="broker"),
            pr_broker = n_broker/n_units)
}

然后致电

table_1 <- group_by(df_1, boro) %>% mysummary
table_2 <- group_by(df_2, boro) %>% mysummary

使用实际工作示例

mysummary <- function(.data) {
  summarize(.data, 
      ave.mpg=mean(mpg),
      ave.hp=mean(hp)
  )
}

mtcars %>% group_by(cyl) %>% mysummary
mtcars %>% group_by(gear) %>% mysummary