我正在尝试创建按不同列分组的新列,但是我不确定我的操作方式是否是使用group_by的最佳方法。我想知道是否有一种方法可以使group_by排成一行?
我知道可以使用语法类型为data.table的包来完成 DT [i,j,by]。
但是,由于这只是较大代码中的一小部分,它使用tidyverse并按原样工作,所以我只是不想偏离这一点。
## Creating Sample Data Frame
state <- rep(c("OH", "IL", "IN", "PA", "KY"),10)
county <- sample(LETTERS[1:5], 50, replace = T) %>% str_c(state,sep = "-")
customers <- sample.int(50:100,50)
sales <- sample.int(500:5000,50)
df <- bind_cols(data.frame(state, county,customers,sales))
## workflow
df2 <- df %>%
group_by(state) %>%
mutate(customerInState = sum(customers),
saleInState = sum(sales)) %>%
ungroup %>%
group_by(county) %>%
mutate(customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
ungroup %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
group_by(state) %>%
mutate(minSale = min(salePerCountyPercent)) %>%
ungroup
我希望我的代码看起来像
df3 <- df %>%
mutate(customerInState = sum(customers, by = state),
saleInState = sum(sales, by = state),
customerInCounty = sum(customers, by = county),
saleInCounty = sum(sales, by = county),
salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState,
minSale = min(salePerCountyPercent, by = state))
它运行没有错误,但是我知道输出不正确
我了解,可以通过较少的group_bys来解决这些突变,从而获得所需的信息。 但是问题是,是否有必要在dplyr中按行分组
答案 0 :(得分:4)
您可以创建包装器以执行所需的操作。如果您有一个分组变量,则此特定解决方案有效。祝你好运!
library(tidyverse)
mutate_by <- function(.data, group, ...) {
group_by(.data, !!enquo(group)) %>%
mutate(...) %>%
ungroup
}
df1 <- df %>%
mutate_by(state,
customerInState = sum(customers),
saleInState = sum(sales)) %>%
mutate_by(county,
customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate_by(state,
minSale = min(salePerCountyPercent))
identical(df2, df1)
[1] TRUE
编辑:或者更简洁/类似于您的代码:
df %>%
mutate_by(customerInState = sum(customers),
saleInState = sum(sales), group = state) %>%
mutate_by(customerInCounty = sum(customers),
saleInCounty = sum(sales), group = county) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate_by(minSale = min(salePerCountyPercent), group = state)
答案 1 :(得分:3)
啊,你的意思是语法样式。不,恐怕这不是tidyverse的运行方式。您需要tidyverse,最好使用管道。但是:(i)一旦您对某项进行了分组,它将一直保持分组状态,直到您再次使用其他列进行分组。 (ii)如果您再次分组,则无需取消分组。因此,我们可以缩短您的代码:
df3 <- df %>%
group_by(county) %>%
mutate(customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
group_by(state) %>%
mutate(customerInState = sum(customers),
saleInState = sum(sales),
salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate(minSale = min(salePerCountyPercent)) %>%
ungroup
两个突变和两个group_by。
现在:列的顺序不同,但是我们可以轻松地测试数据是否相同:
identical((df3 %>% select(colnames(df2))), (df2)) # TRUE
(iii)我对美国的行政结构一无所知,但我假设县嵌套在州内,对吗?那如何使用总结呢?您是否需要保留所有个人销售,还是足以生成每个县和/或每个州的统计数据?
答案 2 :(得分:3)
您可以分两个步骤进行操作,创建两个数据集,然后left_join
。
library(dplyr)
df2 <- df %>%
group_by(state) %>%
summarise(customerInState = sum(customers),
saleInState = sum(sales))
df3 <- df %>%
group_by(state, county) %>%
summarise(customerInCounty = sum(customers),
saleInCounty = sum(sales))
df2 <- left_join(df2, df3) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
group_by(state) %>%
mutate(minSale = min(salePerCountyPercent))
最终清理。
rm(df3)