按组计算的聚合和百分比

时间:2014-10-28 20:25:30

标签: r plyr aggregation

我有一个R类学生每周津贴的数据集,类似于:

Year    ID  Class       Allowance
2013    123 Freshman    100
2013    234 Freshman    110
2013    345 Sophomore   150
2013    456 Sophomore   200
2013    567 Junior      250
2014    678 Junior      100
2014    789 Junior      230
2014    890 Freshman    110
2014    891 Freshman    250
2014    892 Sophomore   220

如何按组(年/班)汇总结果以获得总和和%(按组)?使用ddply获得总结似乎很容易,因为无法获得按组分组的权利。

适用于sum

summary <- ddply(my_data, .(Year, Class), summarize, Sum_Allow=sum(Allowance))

但它不适用于按部分分组的百分比:

summary <- ddply(my_data, .(Year, Class), summarize, Sum_Allow=sum(Allowance),
                 Allow_Pct=Allowance/sum(Allowance))

理想的结果应该如下:

 Year     Class Sum_Allow Allow_Pct
 2013  Freshman       210       26%
 2013    Junior       250       31%
 2013 Sophomore       350       43%
 2014  Freshman       360       40%
 2014    Junior       330       36%
 2014 Sophomore       220       24%

我尝试了plyr软件包中的ddply,但请告诉我这可能有用的方法。

3 个答案:

答案 0 :(得分:7)

以下是使用data.table包的可能解决方案(假设您的数据名为df

library(data.table)
setDT(df)[, list(Sum_Allow = sum(Allowance)), keyby = list(Year, Class)][, 
            Allow_Pct := paste0(round(Sum_Allow/sum(Sum_Allow), 2)*100, "%"), by = Year][]

#    Year     Class Sum_Allow Allow_Pct
# 1: 2013  Freshman       210       26%
# 2: 2013    Junior       250       31%
# 3: 2013 Sophomore       350       43%
# 4: 2014  Freshman       360       40%
# 5: 2014    Junior       330       36%
# 6: 2014 Sophomore       220       24%

贡献给@rawr,这是一个可能的基础R解决方案

df2 <- aggregate(Allowance ~  Class + Year, df, sum)
transform(df2, Allow_pct = ave(Allowance, Year, FUN = function(x) paste0(round(x/sum(x), 2)*100, "%")))
#       Class Year Allowance Allow_pct
# 1  Freshman 2013       210       26%
# 2    Junior 2013       250       31%
# 3 Sophomore 2013       350       43%
# 4  Freshman 2014       360       40%
# 5    Junior 2014       330       36%
# 6 Sophomore 2014       220       24%

答案 1 :(得分:4)

您可以分两步完成此操作

my_data <- read.table(header = TRUE,
                      text = "Year    ID  Class       Allowance
2013    123 Freshman    100
2013    234 Freshman    110
2013    345 Sophomore   150
2013    456 Sophomore   200
2013    567 Junior      250
2014    678 Junior      100
2014    789 Junior      230
2014    890 Freshman    110
2014    891 Freshman    250
2014    892 Sophomore   220")

library(plyr)
(summ <- ddply(my_data, .(Year, Class), summarize, Sum_Allow=sum(Allowance)))

#   Year     Class Sum_Allow
# 1 2013  Freshman       210
# 2 2013    Junior       250
# 3 2013 Sophomore       350
# 4 2014  Freshman       360
# 5 2014    Junior       330
# 6 2014 Sophomore       220

ddply(summ, .(Year), mutate, Allow_pct = Sum_Allow / sum(Sum_Allow) * 100)

#   Year     Class Sum_Allow Allow_pct
# 1 2013  Freshman       210  25.92593
# 2 2013    Junior       250  30.86420
# 3 2013 Sophomore       350  43.20988
# 4 2014  Freshman       360  39.56044
# 5 2014    Junior       330  36.26374
# 6 2014 Sophomore       220  24.17582

我不知道你们其他人是否会发生这种情况,但是当我进行原始尝试时,R会崩溃而不是发出警告。或者,如果我拼错而不是允许,它会崩溃。我真讨厌那个; hadley请修复

永远基地

答案 2 :(得分:3)

所以假设你想要的是:

  1. 获取Year和Class定义的所有组中的Allowance列的总和,以及
  2. 将该金额除以相关年度的总和
  3. 然后这可以在dplyr中做到这一点:

    library(dplyr)
    my_data <- read.table(header = TRUE,
                          text = 
    'Year    ID  Class       Allowance
    2013    123 Freshman    100
    2013    234 Freshman    110
    2013    345 Sophomore   150
    2013    456 Sophomore   200
    2013    567 Junior      250
    2014    678 Junior      100
    2014    789 Junior      230
    2014    890 Freshman    110
    2014    891 Freshman    250
    2014    892 Sophomore   220')
    
    summary <- my_data %>%
      group_by(Year) %>%
      summarise(Year_Sum_Allow = sum(Allowance)) %>%
      left_join(x = my_data, y = ., by = 'Year') %>%
      group_by(Year, Class) %>%
      summarise(Sum_Allow = sum(Allowance),
                Allow_Pct = Sum_Allow/first(Year_Sum_Allow))
    
    summary
    
    # Results
    Source: local data frame [6 x 4]
    Groups: Year
    
      Year     Class Sum_Allow Allow_Pct
    1 2013  Freshman       210 0.2592593
    2 2013    Junior       250 0.3086420
    3 2013 Sophomore       350 0.4320988
    4 2014  Freshman       360 0.3956044
    5 2014    Junior       330 0.3626374
    6 2014 Sophomore       220 0.2417582
    

    如果您不熟悉dplyr,语法可能看起来很奇怪。我建议看看introduction。这节省了很多时间。

    编辑:我应该补充一点,如果你想在示例输出中使用漂亮的百分比格式,你可以在最后一行替换Allow_Pct = paste0(round(Sum_Allow/first(Year_Sum_Allow), 2), '%')

    编辑2:正如jbaums指出的那样,这可以简化为:

    my_data %>% 
      group_by(Year, Class) %>% 
      summarise(sum_allow=sum(Allowance)) %>% 
      mutate(pct_allow=sum_allow/sum(sum_allow))