Question

我有一个data.frame，例如

df1 <- data.frame(id = c("A", "A", "B", "B", "B"), 
                  cost = c(100, 10, 120, 102, 102)

我知道我可以使用

df1.a <- group_by(df1, id) %>%
    summarise(no.c = n(), 
              m.costs = mean(cost))

计算观察次数和平均值。如果我想计算观察的数量和所有不等于ID的行的平均值，我怎么能这样做呢，所以它会给我3作为观测值而不是A和2作为观测而不是B.

我想使用dplyr包和group_by函数，因为我必须使用它来处理大量的数据帧。

Answer 1

您可以使用.来引用整个data.frame，它可以让您计算组与整体之间的差异：

df1 %>% group_by(id) %>% 
    summarise(n = n(), 
              n_other = nrow(.) - n, 
              mean_cost = mean(cost), 
              mean_other = (sum(.$cost) - sum(cost)) / n_other)

## # A tibble: 2 × 5
##       id     n n_other mean_cost mean_other
##   <fctr> <int>   <int>     <dbl>      <dbl>
## 1      A     2       3        55        108
## 2      B     3       2       108         55

从结果中可以看出，您可以使用两个组rev，但这种方法可以轻松扩展到更多组或计算。

Answer 2

正在寻找这样的东西？这将首先计算总成本和总行数，然后减去每个组的总成本和总行数，并计算成本的平均值：

sumCost = sum(df1$cost)
totRows = nrow(df1)

df1 %>% 
        group_by(id) %>% 
        summarise(no.c = totRows - n(), 
                  m.costs = (sumCost - sum(cost))/no.c)

# A tibble: 2 x 3
#      id  no.c m.costs
#  <fctr> <int>   <dbl>
#1      A     3     108
#2      B     2      55

使用group_by并从dplyr汇总所有不包含变量的行到group_by

2 个答案: