Question

我有以下类型的数据框，表示每个公司每年发行的金融产品的数量，以及这些数量代表每年的总发行量的百分比。

  year           company       Volume     Volume Year          %
1 2013            AWK      347902000    21927606761     0.015865936
2 2013            DAR      177977000    21927606761     0.008116572
3 2013            DTC      615627000    21927606761     0.028075431
4 2013            GMT      538456000    21927606761     0.024556077
5 2013            CLW      407497000    21927606761     0.018583743
6 2013            AYI       31970000    21927606761     0.001457979

我想每年选择最大的发行公司，这些公司合起来占市场总量的70％。

我可以手动执行此操作，但是我正在寻找一个可以轻松应用于大型数据集的公式，并且将来可以使用很多！

Answer 1

您可以先按年份和数量排序，然后使用cumsum每年创建ave，然后选择低于70％的那些，例如：

tt  <- read.table(header=T, text="year           company       Volume     VolumeYear          p
2013            AWK      347902000    21927606761     0.015865936
2013            DAR      177977000    21927606761     0.008116572
2013            DTC      615627000    21927606761     0.028075431
2013            GMT      538456000    21927606761     0.024556077
2013            CLW      407497000    21927606761     0.018583743
2013            AYI       31970000    21927606761     0.001457979")

tt <- tt[with(tt, order(year, -Volume)),]
tt$pc  <- with(tt, ave(p, year, FUN=cumsum))
tt[tt$pc <= .7, c("year","company")]

Answer 2

使用 dplyr 库（并假设您的data.frame为DF）：

library(dplyr)

trimmed_DF = DF %>% 
   mutate(percentage = Volume/VolumeYear) %>%    # you already have this column, though.
   group_by(year) %>% 
   mutate(new_col = cumsum(percentage)) %>%
   filter(new_col > 0.30)                        # 0.3 = 1 - 0.7

查找累计和达到极限的最大值

2 个答案: