我有一个数据框,其销售额在ppg,产品级别,我想知道有多少产品对特定%(前75%)的销售贡献,如测试帕累托原则。
数据是
df= structure(list(Ppg = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1",
"p2"), class = "factor"), product = structure(c(1L, 2L, 3L, 4L,
1L, 2L, 3L), .Label = c("A", "B", "C", "D"), class = "factor"),
sales = c(50, 40, 30, 80, 100, 70, 30)), .Names = c("Ppg",
"product", "sales"), row.names = c(NA, -7L), class = "data.frame")
> df
Ppg product sales
1 p1 A 50
2 p1 B 40
3 p1 C 30
4 p1 D 80
5 p2 A 100
6 p2 B 70
7 p2 C 30
我使用dplyr
检索累积金额df %>% group_by(Ppg) %>% summarise(sale = sum(sales) %>% mutate(c1 = cumsum(sales))
Ppg product sales c1
<fctr> <fctr> <dbl> <dbl>
1 p1 A 50 50
2 p1 B 40 90
3 p1 C 30 120
4 p1 D 80 200
5 p2 A 100 100
6 p2 B 70 170
7 p2 C 30 200
有没有办法
i)计算销售比例(基于cumsum)
ii)有多少不同的产品对某些销售额贡献了一笔。
例如ppg p1,2种不同的产品(A&amp; B组合占销售额的75%)
所以最后像下面这样的东西是理想的
ppg Number_Products_towards_75%
p1 2
p2 1
答案 0 :(得分:2)
假设您使用产品当前所在的订单可以得到答案(因为重新排序行会得到不同的结果):
对于1,您可以使用额外的mutate获得结果。只需将累计金额除以该组中所有销售额的总和:
>>> x = np.zeros((13,24))
>>> x.shape
(13,24)
>>> x.resize((1,13,24)).shape
(1,13,24)
获取你:
df %>%
group_by(Ppg) %>%
mutate(c1 = cumsum(sales)) %>%
mutate(percent = c1 / sum(sales))
对于2,您可以使用mutate为该产品添加一个列,如果该产品低于阈值并总结以计算低于阈值的产品(然后将一个添加到计数中,因为还有一个会让您通过它)
# A tibble: 7 x 5
# Groups: Ppg [2]
Ppg product sales c1 percent
<fctr> <fctr> <dbl> <dbl> <dbl>
1 p1 A 50.0 50.0 0.250
2 p1 B 40.0 90.0 0.450
3 p1 C 30.0 120 0.600
4 p1 D 80.0 200 1.00
5 p2 A 100 100 0.500
6 p2 B 70.0 170 0.850
7 p2 C 30.0 200 1.00
得到你:
threshold <- 0.5
df %>%
group_by(Ppg) %>%
mutate(c1 = cumsum(sales)) %>%
mutate(percent = c1 / sum(sales)) %>%
mutate(isbelowthreshold = percent < threshold) %>% # add a column for if it's below the threshold
summarize(count = sum(isbelowthreshold) + 1) # we need to add one since one extra product will put you over the threshold
但这又取决于产品的顺序。考虑首先从最高到最低值排序?像
这样的东西# A tibble: 2 x 2
Ppg count
<fctr> <dbl>
1 p1 3.00
2 p2 1.00