R按组和许多变量计算函数

时间:2013-11-28 09:11:10

标签: r plyr apply

我有一个包含许多变量的大型数据集,并希望通过因子对所有变量进行一些计算,并将结果返回到一个漂亮的数据框中。所以,我的数据可能如下所示:

数据示例:

df <- data.frame( 
  hour    = factor(rep(1:24, each = 100)),
  price   = runif(20)*100,
  cons = sample(1:100,2400, replace = T),
  wind = sample(1:100,2400, replace = T),
  solar = sample(1:100,2400, replace = T)
)

我想对每个变量进行一些简单的计算 - 通过因子 - 使用如下函数:

fx <- function(x) {
  n <- length(x)
  mean <- mean(x)
  median <- median(x)
  std <- sd(x)
  var <- var(x)
  max <- max(x)
  min <- min(x)

#results <-list(n, mean, median, std, var, max, min)
#return(results)

}

将它们放在像这样的数据框架中会很棒:

datasummary: 
hour(factor)   length(price)   mean(price)   ...   min(price)   length(cons)   ...   etc
1         
2
3
..
24

现在这个工作正常,如果我为每个变量手动执行,但我想必须有一个更简单的方法来使用plyr或apply技巧。但是我无法弄清楚如何从单个变量转到整个数据帧,也不知道如何将它变回数据帧。

2 个答案:

答案 0 :(得分:2)

使用R基函数aggregate

set.seed(1)  # your data, set.seed(1) is for reproducibility
df <- data.frame( 
  hour    = factor(rep(1:24, each = 100)),
  price   = runif(20)*100,
  cons = sample(1:100,2400, replace = T),
  wind = sample(1:100,2400, replace = T),
  solar = sample(1:100,2400, replace = T)
)

# a slightly modified version of your function
 fx <- function(x) {
  c(n=length(x), mean=mean(x), median=quantile(x, .5),
    std=sd(x), var=var(x), max=max(x), min=min(x))  
}

# applying your function and getting results
> agresult <- aggregate(.~hour, FUN=fx, data=df)
> agresult <- do.call(data.frame, agresult)
> agresult[1:6,1:8]


 hour price.n price.mean price.median.50. price.std price.var price.max price.min
1    1     100   55.51671         60.09837  28.02782  785.5584  99.19061  6.178627
2    2     100   55.51671         60.09837  28.02782  785.5584  99.19061  6.178627
3    3     100   55.51671         60.09837  28.02782  785.5584  99.19061  6.178627
4    4     100   55.51671         60.09837  28.02782  785.5584  99.19061  6.178627
5    5     100   55.51671         60.09837  28.02782  785.5584  99.19061  6.178627
6    6     100   55.51671         60.09837  28.02782  785.5584  99.19061  6.178627

答案 1 :(得分:1)

不确定。它被称为numcolwise的{​​{1}}参数......

ddply

或使用require( plyr) ddply( df , .(hour) , numcolwise( mean ) ) # hour price cons wind solar #1 1 58.0735 55.21 47.42 48.10 #2 2 58.0735 53.50 47.36 48.91 #3 3 58.0735 52.10 50.13 48.56 #4 4 58.0735 49.78 46.17 53.33 #5 5 58.0735 49.46 50.40 49.29 #6 6 58.0735 49.59 55.66 50.27 ...

reshape2::dcast