Question

我一直在使用dplyr包来创建聚合数据表，例如使用以下代码：

agg_data <- df %>%
 select(calc.method, price1, price2) %>%
 group_by(calc.method) %>%
 summarize(
  count = n(),
  mean_price1 = round(mean(price1, na.rm = TRUE),2),
  mean_price2 = round(mean(price2, na.rm = TRUE),2))

但是，我想只计算组内price1和price2的不同值的平均值

e.g：

Price1：1 1 2 1 2 2 1

转到（聚合之前）：

价格1：1 2 1 2 1

（并且这些一般在移除price1和price2后没有相同数量）。我还想计算每个（price1和price2）的计数，只计算组内的不同值。（组被定义为彼此相邻的两个或更多相同的值）

我试过了：

agg_data <- df %>%
 select(calc.method, price1, price2) %>%
 group_by(calc.method) %>%
 summarize(
  count = n(),
  mean_price1 = round(mean(distinct(price1), na.rm = TRUE),2),
  mean_price2 = round(mean(distinct(price2), na.rm = TRUE),2))

并尝试使用distinct()在select函数中包装列，但这两个都会抛出错误。

有没有办法使用dplyr或其他类似的包来执行此操作而无需从头开始编写内容？

Answer 1

为了满足您对 distinct 的要求，我们需要删除相同的连续值。对于数字向量，可以通过以下方式完成：

x <- x[c(1, which(diff(x) != 0)+1)]

默认使用diff计算向量中相邻元素之间的差异。我们使用它来检测不同的连续值，diff(x) != 0。由于输出差异滞后1，我们将1添加到这些不同元素的索引中，并且我们还希望第一个元素是不同的。例如：

x <- c(1,1,2,1,2,2,1)
x <- x[c(1, which(diff(x) != 0)+1)]
##[1] 1 2 1 2 1

然后我们可以将其与dplyr：

一起使用

agg_data <- df %>% group_by(calc.method) %>%
                   summarize(count = n(),
                             count_non_rep_1 = length(price1[c(1,which(diff(price1) != 0)+1)]),
                             mean_price1 = round(mean(price1[c(1,which(diff(price1) != 0)+1)], na.rm=TRUE),2),
                             count_non_rep_2 = length(price2[c(1,which(diff(price2) != 0)+1)]),
                             mean_price2 = round(mean(price2[c(1,which(diff(price2) != 0)+1)], na.rm=TRUE),2))

或者，更好的是，定义函数：

remove.repeats <- function(x) {
  x[c(1,which(diff(x) != 0)+1)]
}

并将其与dplyr：

一起使用

agg_data <- df %>% group_by(calc.method) %>%
                   summarize(count = n(),
                             count_non_rep_1 = length(remove.repeats(price1)),
                             mean_price1 = round(mean(remove.repeats(price1), na.rm=TRUE),2),
                             count_non_rep_2 = length(remove.repeats(price2)),                             
                             mean_price2 = round(mean(remove.repeats(price2), na.rm=TRUE),2))

在一些希望与您相似的示例数据上使用此功能：

df <- structure(list(calc.method = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), 
price1 = c(1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 3), 
price2 = c(1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1)), 
.Names = c("calc.method", "price1", "price2"), row.names = c(NA, -15L), class = "data.frame")
##   calc.method price1 price2
##1            A      1      1
##2            A      1      1
##3            A      2      1
##4            A      1      1
##5            A      2      1
##6            A      2      1
##7            A      1      1
##8            B      1      2
##9            B      1      1
##10           B      2      2
##11           B      2      1
##12           B      2      2
##13           B      2      1
##14           B      1      2
##15           B      3      1

我们得到：

print(agg_data)
### A tibble: 2 x 6
##  calc.method count count_non_rep_1 mean_price1 count_non_rep_2 mean_price2
##       <fctr> <int>           <int>       <dbl>           <int>       <dbl>
##1           A     7               5        1.40               1         1.0
##2           B     8               4        1.75               8         1.5

一种干净的聚合/分组方式，只涉及每列中的不同值？

1 个答案: