我试图在包含group_by的dplyr流中使用自定义函数(返回标量)进行汇总或变异。当我直接调用该函数时,该函数起作用,但是只要它跟随group_by,它就不起作用。
请参阅我的代码以了解尝试的内容。我设法使其正常运行,但是我觉得这是一种怪诞的方式-我想了解为什么它无法按我期望的那样工作:
## Load required libraries
library(dplyr)
library(tidyr)
library(ROCR)
set.seed(0)
## Generate fake data
df1 <- data.frame(predictions = seq(0,1,.01), date = seq(as.Date("2009-01-01"), by = "month", length.out = 101), labels = sample(c(0,1), replace=TRUE, size=101), grouping = rep('a',101))
df2 <- data.frame(predictions = seq(0,1,.01), date = seq(as.Date("2010-01-01"), by = "month", length.out = 101), labels = sample(c(0,1), replace=TRUE, size=101), grouping = rep('b',101))
df <- rbind(df1,df2)
## Gini coefficient calculation function
dplyr_Gini <- function(df, predictions, labels, label.ordering = NULL,...){
predictions = enquo(predictions)
labels = enquo(labels)
predictions <- df %>% select(!!predictions) %>% .[[1]]
labels <- df %>% select(!!labels) %>% .[[1]]
if(length(unique(labels)) != 2){
return(NA)
}
pred <- prediction(predictions, labels, label.ordering)
auc.perf = performance(pred, measure = "auc")
gini = 2*auc.perf@y.values[[1]] - 1
return(gini)
}
## test dplyr_Gini - works as expected
dplyr_Gini(df1,predictions, labels)
> [1] -0.05494505
dplyr_Gini(df2,predictions, labels)
> [1] 0.09456265
## Wrapper function for using dplyr_Gini in group_by
calc_Gini <- function(df, group, predictions, labels){
predictions <- enquo(predictions)
labels = enquo(labels)
df %>% filter(grouping %in% group) %>%
group_by(grouping) %>%
summarise(min.date = min(date),
max.date = max(date),
Gini = dplyr_Gini(.,!!predictions, !!labels)) %>%
ungroup()
}
calc_Gini(df,group = c('a','b'),predictions, labels)
> # Adding missing grouping variables: `grouping`
> # Adding missing grouping variables: `grouping`
> # Error in prediction(predictions, labels, label.ordering) :
> # Format of predictions is invalid.
## Wrapper function that works for using dplyr_Gini in group_by - but is kind of hacky.
calc_Gini_working <- function(df, group, predictions, labels){
predictions <- enquo(predictions)
labels = enquo(labels)
df %>% filter(grouping %in% group) %>%
group_by(grouping) %>%
mutate(min.date = min(date),
max.date = max(date)) %>%
group_by(grouping, min.date, max.date) %>%
do(Gini = dplyr_Gini(.,!!predictions, !!labels)) %>%
unnest() %>%
ungroup()
}
calc_Gini_working(df,group = c('a','b'),predictions, labels)
>
# A tibble: 2 x 4
grouping min.date max.date Gini
<fct> <date> <date> <dbl>
1 a 2009-01-01 2017-05-01 -0.0549
2 b 2010-01-01 2018-05-01 0.0946
我觉得 calc_Gini 函数会起作用,因为我刚刚在group_by之后的摘要中添加了自定义函数(dplyr_Gini)。
如您所见,如果我将 dplyr_Gini 包装在 do 中,然后 nestest ,它的效果就很好-但我不确定为什么。
答案 0 :(得分:0)
根据dplyr_Gini
的构建方式,一种选择是group_split
然后使用map
library(tidyverse)
calc_Gini <- function(df, group, predictions, labels){
predictions <- enquo(predictions)
labels = enquo(labels)
df %>% filter(grouping %in% group) %>%
group_split(grouping, remove = FALSE) %>%
map_dfr(., ~
tibble(grouping = first(.x$grouping), min.date = min(.x$date),
max.date = max(.x$date),
Gini = dplyr_Gini(.x, !!predictions, !!labels)))
}
calc_Gini(df,group = c('a','b'),predictions, labels)
# A tibble: 2 x 4
# grouping min.date max.date Gini
# <fct> <date> <date> <dbl>
#1 a 2009-01-01 2017-05-01 -0.0549
#2 b 2010-01-01 2018-05-01 0.0946