在R

时间:2018-06-15 15:27:31

标签: r dplyr plyr lapply

我有数据集

df=structure(list(SKU = c(11202L, 11202L, 11202L, 11202L, 11202L, 
11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 
11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L
), stuff = c(8.85947691, 9.450108704, 10.0407405, 10.0407405, 
10.63137229, 11.22200409, 11.22200409, 11.81263588, 12.40326767, 
12.40326767, 12.40326767, 12.99389947, 13.58453126, 14.17516306, 
14.76579485, 15.94705844, 17.12832203, 17.71895382, 21.26274458, 
25.98779894, 63.19760196), action = c(0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), 
    acnumber = c(137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L, 
    137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L, 
    137L, 137L, 137L), year = c(2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L)), .Names = c("SKU", 
"stuff", "action", "acnumber", "year"), class = "data.frame", row.names = c(NA, 
-21L))

操作列只有两个值0和1。 我们可以看到有1个类别的东西有3个观察,零类别的东西有18个观察。

我需要 -仅在没有零的类别1(等于25.98779894)的情况下计算stuff变量的中位数。 我们可以看到一个之间有零,它们需要被删除,如果它们存在则需要被删除。 即,好像数据集是这样的:

structure(list(SKU = c(11202L, 11202L, 11202L, 11202L, 11202L, 
11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 
11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L, 11202L
), stuff = c(8.85947691, 9.450108704, 10.0407405, 10.0407405, 
10.63137229, 11.22200409, 11.22200409, 11.81263588, 12.40326767, 
12.40326767, 12.40326767, 12.99389947, 13.58453126, 14.17516306, 
14.76579485, 15.94705844, 17.12832203, 17.71895382, 21.26274458, 
25.98779894, 63.19760196), action = c(0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L), 
    acnumber = c(137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L, 
    137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L, 137L, 
    137L, 137L, 137L), year = c(2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
    2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L)), .Names = c("SKU", 
"stuff", "action", "acnumber", "year"), class = "data.frame", row.names = c(NA, 
-21L))

另外,我需要通过类别0的stuff变量计算最后三个观察值的中位数,该变量在第一个之前, 在我们的例子中它是12,40326767

然后从类别1的中位数中减去0类中位数并乘以1的数字,在本例中为3。

(25,98779894-12,40326767)* 3 = 40,75359381

此解决方案

df %>%
  group_by(SKU,acnumber,year) %>%
  summarize(value = 3*(median(stuff[action==1]) - median(stuff[match(1,action)-3:1])),
            stuff=first(stuff),
            action = sum(action)) %>%
  select(SKU,stuff,action,acnumber,year,value)

由Moody_Mudskipper帮助我

但是!在这个例子中,行动的数量是3,所以我们乘以3, 但是1的数量可以大于3或小于3。 如何乘以实数? 例如,如果我们有两个按行动的东西,那么

summarize(value = 2*(median(stuff[action==1]) - median(stuff[match(1,action)-3:1])),

所以每次都不要手动输入。

解决方案 sum(df$action == 1)不合适

summarize(value = sum(df$action == 1)*(median(stuff[action==1]) - median(stuff[match(1,action)-3:1])),

因为它总结了所有数据集,然后存在不正确的乘法。 总计数= 692,此数字乘以

 summarize(value = 692*(median(stuff[action==1]) - median(stuff[match(1,action)-3:1])),

这是错的 1的乘法必须是每个特定组SKU,acnumber,年

111-23-2018 is first group has 3 ones
112-24-2018 is second group has 2 ones

等等

如何做到正确?

1 个答案:

答案 0 :(得分:1)

df%>%
   group_by(SKU,acnumber,year)%>%
   summarise(s=sum(action),k=which(action==1)[1],
            l=s*(median(stuff[action==1])-median(stuff[(k-s+1):k])))%>%
   data.frame()
    SKU acnumber year s  k        l
1 11202      137 2018 3 11 40.75359