在dplyr中按行分组以对列进行突变

时间:2019-07-16 14:41:10

标签: r dplyr

我正在尝试创建按不同列分组的新列,但是我不确定我的操作方式是否是使用group_by的最佳方法。我想知道是否有一种方法可以使group_by排成一行?

我知道可以使用语法类型为data.table的包来完成 DT [i,j,by]。

但是,由于这只是较大代码中的一小部分,它使用tidyverse并按原样工作,所以我只是不想偏离这一点。

## Creating Sample Data Frame
state <- rep(c("OH", "IL", "IN", "PA", "KY"),10) 
county <- sample(LETTERS[1:5], 50, replace = T) %>% str_c(state,sep = "-") 
customers <- sample.int(50:100,50) 
sales <- sample.int(500:5000,50)

df <- bind_cols(data.frame(state, county,customers,sales))

## workflow

df2 <- df %>%
  group_by(state) %>% 
  mutate(customerInState = sum(customers),
         saleInState = sum(sales)) %>% 
  ungroup %>% 
  group_by(county) %>% 
  mutate(customerInCounty = sum(customers),
         saleInCounty = sum(sales)) %>% 
  ungroup %>% 
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  group_by(state) %>% 
  mutate(minSale = min(salePerCountyPercent)) %>%
  ungroup

我希望我的代码看起来像

df3 <- df %>%
  mutate(customerInState = sum(customers, by = state),
         saleInState = sum(sales, by = state),
         customerInCounty = sum(customers, by = county),
         saleInCounty = sum(sales, by = county),
         salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState,
         minSale = min(salePerCountyPercent, by = state))

它运行没有错误,但是我知道输出不正确

我了解,可以通过较少的group_bys来解决这些突变,从而获得所需的信息。 但是问题是,是否有必要在dplyr中按行分组

3 个答案:

答案 0 :(得分:4)

您可以创建包装器以执行所需的操作。如果您有一个分组变量,则此特定解决方案有效。祝你好运!

library(tidyverse)

mutate_by <- function(.data, group, ...) {

  group_by(.data, !!enquo(group)) %>%
    mutate(...) %>%
    ungroup

}

df1 <- df %>%
  mutate_by(state, 
            customerInState = sum(customers),
            saleInState = sum(sales)) %>%
  mutate_by(county,
            customerInCounty = sum(customers),
            saleInCounty = sum(sales)) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate_by(state,
            minSale = min(salePerCountyPercent))

identical(df2, df1)
[1] TRUE

编辑:或者更简洁/类似于您的代码:

df %>%
  mutate_by(customerInState = sum(customers),
            saleInState = sum(sales), group = state) %>%
  mutate_by(customerInCounty = sum(customers),
            saleInCounty = sum(sales), group = county) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate_by(minSale = min(salePerCountyPercent), group = state)

答案 1 :(得分:3)

啊,你的意思是语法样式。不,恐怕这不是tidyverse的运行方式。您需要tidyverse,最好使用管道。但是:(i)一旦您对某项进行了分组,它将一直保持分组状态,直到您再次使用其他列进行分组。 (ii)如果您再次分组,则无需取消分组。因此,我们可以缩短您的代码:

df3 <- df %>% 
  group_by(county) %>% 
  mutate(customerInCounty = sum(customers), 
         saleInCounty = sum(sales)) %>% 
  group_by(state) %>% 
  mutate(customerInState = sum(customers),
         saleInState = sum(sales),
         salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate(minSale = min(salePerCountyPercent)) %>%
  ungroup

两个突变和两个group_by。

现在:列的顺序不同,但是我们可以轻松地测试数据是否相同:

identical((df3 %>% select(colnames(df2))), (df2)) # TRUE

(iii)我对美国的行政结构一无所知,但我假设县嵌套在州内,对吗?那如何使用总结呢?您是否需要保留所有个人销售,还是足以生成每个县和/或每个州的统计数据?

答案 2 :(得分:3)

您可以分两个步骤进行操作,创建两个数据集,然后left_join

library(dplyr)

df2 <- df %>%
  group_by(state) %>% 
  summarise(customerInState = sum(customers),
         saleInState = sum(sales))

df3 <- df %>%
  group_by(state, county) %>%
  summarise(customerInCounty = sum(customers),
            saleInCounty = sum(sales))

df2 <- left_join(df2, df3) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  group_by(state) %>% 
  mutate(minSale = min(salePerCountyPercent))

最终清理。

rm(df3)