Question

给出简化数据

set.seed(13)

user_id = rep(1:2, each = 10)
order_id = sample(1:20, replace = FALSE)
cost = round(runif(20, 1.5, 75),1)
category = sample( c("apples", "pears", "chicken"), 20, replace = TRUE)
pit = rep(c(0,0,0,0,1), 4)

df = data.frame(cbind(user_id, order_id, cost, category, pit))

user_id order_id cost category pit
      1       15 11.6    pears   0
      1        5 41.7   apples   0
      1        8 51.3  chicken   0
      1        2 40.3    pears   0
      1       16  7.9    pears   1
      1        1 47.1  chicken   0
      1        9  3.8   apples   0
      1       10 35.4   apples   0
      1       11 25.8  chicken   0
      1       20 48.1  chicken   1
      2        7 32.6    pears   0
      2       18 31.3    pears   0
      2       14   69   apples   0
      2        4 60.9  chicken   0
      2       13 41.2   apples   1
      2       17  9.4    pears   0
      2       19 34.9   apples   0
      2        6  5.3    pears   0
      2        3 57.3   apples   0
      2       12  7.7   apples   1

我想创建自上次坑== 1 以来累计成本和不同类别的列数。所以结果看起来像这样：

user_id order_id cost category pit cum_cost distinct_categories 1 15 11.6 pears 0 11.6 1 1 5 41.7 apples 0 53.3 2 1 8 51.3 chicken 0 104.6 3 1 2 40.3 pears 0 144.9 3 1 16 7.9 pears 1 152.8 3 1 1 47.1 chicken 0 47.1 1 1 9 3.8 apples 0 50.9 2 1 10 35.4 apples 0 86.3 2 1 11 25.8 chicken 0 112.1 3 1 20 48.1 chicken 1 160.2 3 2 7 32.6 pears 0 32.6 1 2 18 31.3 pears 0 63.9 1 2 14 69 apples 0 132.9 2 2 4 60.9 chicken 0 193.8 3 2 13 41.2 apples 1 235.0 3 2 17 9.4 pears 0 9.4 1 2 19 34.9 apples 0 44.3 2 2 6 5.3 pears 0 49.6 2 2 3 57.3 apples 0 106.9 2 2 12 7.7 apples 1 114.6 2

理想情况下，解决方案将在dplyr，但我对其他软件包/方法持开放态度。非常感谢您的帮助！ KASIA

Answer 1

我们可以使用dplyr。通过'user_id'和通过获取'pit'的累积总和并获得其lag创建的分组变量进行分组，我们得到'{1}}的'cost'为'cum_cost'和{{1} }'类别'和cumsum'类别'之间的cummax索引为'distinct_categories。

match

R：计算自上次出现值以来的累计总和和计数

1 个答案: