用聚合重写一个复杂的ddply

时间:2013-11-24 19:42:12

标签: r plyr

我想重新编写以下有些复杂的plyr命令,以便它更快并使用aggregate,tapply或data.table。

此功能允许您输入多个ID变量并测量变量,然后返回多个计算。但是,在更大的数据集上,它可能不是最有效的。


这是代码......

require(ggplot2) # to get the diamonds data set
require(plyr)

mean_sd_for_several_variables <- function(df, lvls, measures) {
  res <- ddply(df, lvls, function(x) {
                 ret <- vector()
                 for(measure in measures) {
                   mean_sd <- c(mean(x[,measure]), sd(x[,measure]))
                   names(mean_sd) <- c(paste0("mean_", measure), paste0("sd_", measure))
                   ret <- c(ret, mean_sd)
                 }
                 return(ret)
               }
  )

  print(res)
}

...返回:

mean_sd_for_several_variables(diamonds, c("color", "cut"), c("price","depth"))

   color       cut mean_price sd_price mean_depth sd_depth
1      D      Fair     4291.1   3286.1     64.048  3.29220
2      D      Good     3405.4   3175.1     62.366  2.22240
3      D Very Good     3470.5   3523.8     61.750  1.46223
4      D   Premium     3631.3   3711.6     61.169  1.15806
5      D     Ideal     2629.1   3001.1     61.678  0.71201
6      E      Fair     3682.3   2976.7     63.320  4.42103
7      E      Good     3423.6   3330.7     62.204  2.23059
8      E Very Good     3214.7   3408.0     61.730  1.42377
9      E   Premium     3538.9   3795.0     61.176  1.16454
10     E     Ideal     2597.6   2956.0     61.687  0.70718
11     F      Fair     3827.0   3223.3     63.508  3.70209
12     F      Good     3495.8   3202.4     62.202  2.23976
13     F Very Good     3778.8   3786.1     61.722  1.38939
14     F   Premium     4324.9   4012.0     61.260  1.16775
15     F     Ideal     3374.9   3766.6     61.676  0.69398
16     G      Fair     4239.3   3609.6     64.340  3.57340
17     G      Good     4123.5   3702.5     62.527  2.03893
18     G Very Good     3872.8   3861.4     61.841  1.33169
19     G   Premium     4500.7   4356.6     61.279  1.15341
20     G     Ideal     3720.7   4006.3     61.700  0.68714
21     H      Fair     5135.7   3886.5     64.585  3.14173
22     H      Good     4276.3   4020.7     62.500  2.09212
23     H Very Good     4535.4   4185.8     61.968  1.31895
24     H   Premium     5216.7   4466.2     61.322  1.15164
25     H     Ideal     3889.3   4013.4     61.733  0.72939
26     I      Fair     4685.4   3730.3     64.221  3.68771
27     I      Good     5078.5   4631.7     62.475  2.17958
28     I Very Good     5255.9   4687.1     61.935  1.32890
29     I   Premium     5946.2   5053.7     61.329  1.15338
30     I     Ideal     4452.0   4505.2     61.794  0.72334
31     J      Fair     4975.7   4050.5     64.357  3.31595
32     J      Good     4574.2   3707.8     62.396  2.12091
33     J Very Good     5103.5   4135.7     61.902  1.33679
34     J   Premium     6294.6   4788.9     61.390  1.13989
35     J     Ideal     4918.2   4476.2     61.822  0.94669

2 个答案:

答案 0 :(得分:4)

以下是data.table解决方案

mean_and_sd <- function(.SD){
  x1 = lapply(.SD, mean)
  x2 = lapply(.SD, sd)
  cbind(x1, x2) 
}

library(data.table)
DT = data.table(diamonds)
DT[, mean_and_sd(.SD), by = c("cut", "color"), .SDcols = c("price", "carat")]

您可以将其放入一个接受所需输入并返回相应数据框的函数中。

答案 1 :(得分:4)

使用aggregate

> result <- aggregate(cbind(price, depth) ~ color+cut,
                  FUN=function(x) c(mean=mean(x), sd=sd(x)),
                  data=diamonds)
> do.call(data.frame, result)
   color       cut price.mean price.sd depth.mean   depth.sd
1      D      Fair   4291.061 3286.114 64.0484663  3.2921972
2      E      Fair   3682.312 2976.652 63.3196429  4.4210329
3      F      Fair   3827.003 3223.303 63.5080128  3.7020938
4      G      Fair   4239.255 3609.644 64.3398089  3.5733985
5      H      Fair   5135.683 3886.482 64.5851485  3.1417311