R如何根据另一个变量的范围获得一个变量的平均值?

时间:2011-08-30 14:39:37

标签: r dataframe aggregate

如果我有两个变量X和Y的一系列观察,我怎样才能根据变量X的范围得到Y的平均值?

例如,有一些数据如:

df = data.frame(x=runif(50,1,100),y=runif(50,300,700))

我怎么能得到答案“当X为1-10时,平均值为332.4,当X为11-20时,y的平均值为632.3,等等......”

6 个答案:

答案 0 :(得分:6)

使用cut剪切x,然后在ddply包中使用plyr

> df$xrange <- cut(df$x, breaks=seq(0, 100, 10))

library(plyr)
ddply(df, .(xrange), summarize, mean_y=mean(y))
     xrange   mean_y
1    (0,10] 490.7571
2   (10,20] 462.6347
3   (20,30] 507.5614
4   (30,40] 482.6004
5   (40,50] 510.3081
6   (50,60] 480.7927
7   (60,70] 507.8944
8   (70,80] 458.4668
9   (80,90] 501.9672
10 (90,100] 493.4844

答案 1 :(得分:4)

使用cut来形成群组,并使用tapply对其进行汇总。

df$grp <- cut(df$x, seq(0, 100, 10))
with(df, tapply(y, grp, mean))

如果您是plyr粉丝,您可能更喜欢

library(plyr)
ddply(df, .(grp), summarise, m = mean(y))

为完整起见,aggregate版本为

aggregate(y ~ grp, df, mean)

答案 2 :(得分:3)

一种方法是使用cut()x变量创建一个因子,指定每十个单位的中断。鉴于该因素,您可以使用by()aggregate()或...来汇总数据框,或者更确切地说是y列:

R> set.seed(42); DF <- data.frame(x=runif(50,1,100), y=rnorm(50,30,70))
R> summary(DF)
       x               y         
 Min.   : 1.39   Min.   :-179.5  
 1st Qu.:40.66   1st Qu.: -19.4  
 Median :64.45   Median :  39.6  
 Mean   :60.29   Mean   :  25.9  
 3rd Qu.:90.10   3rd Qu.:  74.7  
 Max.   :98.90   Max.   : 140.3  
R> DF$cx <- cut(DF$x, breaks=seq(0,100,by=10))
R> ?by
R> by(DF, DF$cx, FUN=function(z) mean(z$y))
DF$cx: (0,10]
[1] 67.8747
--------------------------------------------- 
DF$cx: (10,20]
[1] 52.9104
--------------------------------------------- 
DF$cx: (20,30]
[1] -53.8961
--------------------------------------------- 
DF$cx: (30,40]
[1] 44.1992
--------------------------------------------- 
DF$cx: (40,50]
[1] 21.7404
--------------------------------------------- 
DF$cx: (50,60]
[1] 16.2122
--------------------------------------------- 
DF$cx: (60,70]
[1] -27.0338
--------------------------------------------- 
DF$cx: (70,80]
[1] 42.283
--------------------------------------------- 
DF$cx: (80,90]
[1] 40.8042
--------------------------------------------- 
DF$cx: (90,100]
[1] 38.8917
R> 

或使用ddply()

R> library(plyr)
R> ddply(DF, .(cx), function(z) mean(z$y))
         cx       V1
1    (0,10]  67.8747
2   (10,20]  52.9104
3   (20,30] -53.8961
4   (30,40]  44.1992
5   (40,50]  21.7404
6   (50,60]  16.2122
7   (60,70] -27.0338
8   (70,80]  42.2830
9   (80,90]  40.8042
10 (90,100]  38.8917
R> 

答案 3 :(得分:3)

我认为你的问题导致你的答案过于狭窄。你应该考虑使用回归方法来总结连续变量的联合关系。使用散点图和拟合回归样条绘制对基础关系的暴力程度将低于您指定的分段分析。

答案 4 :(得分:3)

以下是data.table解决方案

require(data.table)
data.table(df)[,list(mean_y = mean(y)), by = 'cut(x, seq(0, 100, 10))']

答案 5 :(得分:2)

您可以使用tapplypretty一起制作cut的断点:

 tapply(df$y,cut(df$x,pretty(range(df$x),high.u.bias=0.1)),mean)
  (0,10]  (10,20]  (20,30]  (30,40]  (40,50]  (50,60]  (60,70]  (70,80] 
496.9840 510.4164 502.4092 492.5806 493.3364 549.5207 507.4511 472.3391 
 (80,90] (90,100] 
479.8795 482.6728 

aggregate也可以使用:

aggregate(df$y,list(cut(df$x,pretty(range(df$x),high.u.bias=0.1))),FUN=mean)
    Group.1        x
1    (0,10] 496.9840
2   (10,20] 510.4164
3   (20,30] 502.4092
4   (30,40] 492.5806
5   (40,50] 493.3364
6   (50,60] 549.5207
7   (60,70] 507.4511
8   (70,80] 472.3391
9   (80,90] 479.8795
10 (90,100] 482.6728