我有一个数据集,其中有以下列:flavor,flavorid和unitSoled。
Flavor Flavorid unitsoled
beans 350 6
creamy 460 2
.
.
.
我想找到十大口味,然后计算每种口味的市场份额。我的逻辑是每种风味的市场份额=以特定风味为单位的单位除以底部的总单位。
我如何实现这一点。对于输出我只想要两个col Flavorid和相应的市场份额。我是否需要先在某些表格中保存前十种口味?
答案 0 :(得分:4)
一种方法是使用dplyr
包:
示例数据集:
flavor <- rep(letters[1:15],each=5)
flavorid <- rep(1:15,each=5)
unitsold <- 1:75
df <- data.frame(flavor,flavorid,unitsold)
> df
flavor flavorid unitsold
1 a 1 1
2 a 1 2
3 a 1 3
4 a 1 4
5 a 1 5
6 b 2 6
7 b 2 7
8 b 2 8
9 b 2 9
...
...
<强>解决方案:强>
library(dplyr)
df %>%
select(flavorid,unitsold) %>% #select the columns you want
group_by(flavorid) %>% #group by flavorid
summarise(total=sum(unitsold)) %>% #sum the total units sold per id
mutate(marketshare=total/sum(total)) %>% #calculate the market share per id
arrange( desc(marketshare)) %>% #order by marketshare descending
head(10) #pick the 10 first
#and you can add another select(flavorid,marketshare) if you only want those two
<强>输出:强>
Source: local data frame [10 x 3]
flavorid total marketshare
1 15 365 0.12807018
2 14 340 0.11929825
3 13 315 0.11052632
4 12 290 0.10175439
5 11 265 0.09298246
6 10 240 0.08421053
7 9 215 0.07543860
8 8 190 0.06666667
9 7 165 0.05789474
10 6 140 0.04912281