我有数据集
df=structure(list(SKU = c(11202L, 11202L, 11202L, 11202L, 11202L,
11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L,
11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L
), stuff = c(8.85947691, 9.450108704, 10.0407405, 10.0407405,
10.63137229, 11.22200409, 11.22200409, 11.81263588, 12.40326767,
12.40326767, 12.40326767, 12.99389947, 13.58453126, 14.17516306,
14.76579485, 15.94705844, 17.12832203, 17.71895382, 21.26274458,
25.98779894, 63.19760196), action = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L),
acnumber = c(137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L,
137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L,
137L, 137L, 137L), year = c(2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L)), .Names = c("SKU",
"stuff", "action", "acnumber", "year"), class = "data.frame", row.names = c(NA,
-21L))
操作列只有两个值0和1。 我们可以看到有1个类别的东西有3个观察,零类别的东西有18个观察。
我需要
-
仅在没有零的类别1(等于25.98779894)的情况下计算stuff变量的中位数。
我们可以看到一个之间有零,它们需要被删除,如果它们存在则需要被删除。
即,好像数据集是这样的:
structure(list(SKU = c(11202L, 11202L, 11202L, 11202L, 11202L,
11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L,
11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L
), stuff = c(8.85947691, 9.450108704, 10.0407405, 10.0407405,
10.63137229, 11.22200409, 11.22200409, 11.81263588, 12.40326767,
12.40326767, 12.40326767, 12.99389947, 13.58453126, 14.17516306,
14.76579485, 15.94705844, 17.12832203, 17.71895382, 21.26274458,
25.98779894, 63.19760196), action = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L),
acnumber = c(137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L,
137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L,
137L, 137L, 137L), year = c(2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L)), .Names = c("SKU",
"stuff", "action", "acnumber", "year"), class = "data.frame", row.names = c(NA,
-21L))
另外,我需要通过类别0的stuff变量计算最后三个观察值的中位数,该变量在第一个之前, 在我们的例子中它是12,40326767
然后从类别1的中位数中减去0类中位数并乘以1的数字,在本例中为3。
(25,98779894-12,40326767)* 3 = 40,75359381
我该怎么做这个操作?
作为我期望的输出
SKU stuff action acnumber year value
11202 8,85947691 3 137 2018 40,75359381
答案 0 :(得分:2)
这是一个tidyverse
解决方案:
df %>%
group_by(SKU,acnumber,year) %>%
summarize(value = 3*(median(stuff[action==1]) - median(stuff[match(1,action)-3:1])),
stuff=first(stuff),
action = sum(action)) %>%
select(SKU,stuff,action,acnumber,year,value)
# # A tibble: 1 x 6
# # Groups: SKU, acnumber [1]
# SKU stuff action acnumber year value
# <int> <dbl> <int> <int> <int> <dbl>
# 1 11202 8.86 3 137 2018 40.8