我正在使用dplyr创建一个表。我想对多个数据集执行相同的“汇总”命令。我知道在ggplot2中,您可以更改数据集并重新运行绘图,这很酷。
这是我想要避免的:
table_1 <-
group_by(df_1, boro) %>%
summarize(n_units = n(),
mean_rent = mean(rent_numeric, na.rm = TRUE),
sd_rend = sd(rent_numeric,na.rm = TRUE),
median_rent = median(rent_numeric, na.rm = TRUE),
mean_bedrooms = mean(bedrooms_numeric, na.rm = TRUE),
sd_bedrooms = sd(bedrooms_numeric, na.rm = TRUE),
mean_sqft = mean(sqft, na.rm = TRUE),
sd_sqft = sd(sqft, na.rm = TRUE),
n_broker = sum(ob=="broker"),
pr_broker = n_broker/n_units)
table_2 <-
group_by(df_2, boro) %>%
summarize(n_units = n(),
mean_rent = mean(rent_numeric, na.rm = TRUE),
sd_rend = sd(rent_numeric,na.rm = TRUE),
median_rent = median(rent_numeric, na.rm = TRUE),
mean_bedrooms = mean(bedrooms_numeric, na.rm = TRUE),
sd_bedrooms = sd(bedrooms_numeric, na.rm = TRUE),
mean_sqft = mean(sqft, na.rm = TRUE),
sd_sqft = sd(sqft, na.rm = TRUE),
n_broker = sum(ob=="broker"),
pr_broker = n_broker/n_units)
基本上,有没有办法将汇总命令设置为函数或者某些东西,所以我可以直接注入df_1和df_2?
答案 0 :(得分:4)
如果您事先知道所有变量名称,并且在您想要查看的所有数据集中它们是相同的,您可以执行以下操作:
myfunc <- function(df) {
df %>%
group_by(cyl) %>%
summarize(n = n(),
mean_hp = mean(hp))
}
myfunc(mtcars)
#Source: local data frame [3 x 3]
#
# cyl n mean_hp
#1 4 11 82.63636
#2 6 7 122.28571
#3 8 14 209.21429
然后将其与不同的数据集(具有相同的结构和变量名称)一起使用。如果您需要灵活性,即您事先并不知道所有变量以及在函数中将它们指定为输入的内容,请查看dplyr non standard evaluation vignette。
这里只是一个很小的例子,说明如何在您的函数中实现“标准评估”以提供更大的灵活性。考虑是否要允许函数的用户指定数据应分组到哪一列,您可以这样做:
myfunc <- function(df, grp) {
df %>%
group_by_(grp) %>% # notice that I use "group_by_" instead of "group_by"
summarize(n = n(),
mean_hp = mean(hp))
}
and then use it:
myfunc(mtcars, "gear")
#Source: local data frame [3 x 3]
#
# gear n mean_hp
#1 3 15 176.1333
#2 4 12 89.5000
#3 5 5 195.6000
myfunc(mtcars, "cyl")
#Source: local data frame [3 x 3]
#
# cyl n mean_hp
#1 4 11 82.63636
#2 6 7 122.28571
#3 8 14 209.21429
答案 1 :(得分:3)
%>%
运算符只传递一个tbl对象作为下一个函数的第一个参数。 summarize
只需要一个tbl。所以你可以定义
mysummary <- function(.data) {
summarize(.data, n_units = n(),
mean_rent = mean(rent_numeric, na.rm = TRUE),
sd_rend = sd(rent_numeric,na.rm = TRUE),
median_rent = median(rent_numeric, na.rm = TRUE),
mean_bedrooms = mean(bedrooms_numeric, na.rm = TRUE),
sd_bedrooms = sd(bedrooms_numeric, na.rm = TRUE),
mean_sqft = mean(sqft, na.rm = TRUE),
sd_sqft = sd(sqft, na.rm = TRUE),
n_broker = sum(ob=="broker"),
pr_broker = n_broker/n_units)
}
然后致电
table_1 <- group_by(df_1, boro) %>% mysummary
table_2 <- group_by(df_2, boro) %>% mysummary
使用实际工作示例
mysummary <- function(.data) {
summarize(.data,
ave.mpg=mean(mpg),
ave.hp=mean(hp)
)
}
mtcars %>% group_by(cyl) %>% mysummary
mtcars %>% group_by(gear) %>% mysummary