如果我有两个变量X和Y的一系列观察,我怎样才能根据变量X的范围得到Y的平均值?
例如,有一些数据如:
df = data.frame(x=runif(50,1,100),y=runif(50,300,700))
我怎么能得到答案“当X为1-10时,平均值为332.4,当X为11-20时,y的平均值为632.3,等等......”
答案 0 :(得分:6)
使用cut
剪切x,然后在ddply
包中使用plyr
:
> df$xrange <- cut(df$x, breaks=seq(0, 100, 10))
library(plyr)
ddply(df, .(xrange), summarize, mean_y=mean(y))
xrange mean_y
1 (0,10] 490.7571
2 (10,20] 462.6347
3 (20,30] 507.5614
4 (30,40] 482.6004
5 (40,50] 510.3081
6 (50,60] 480.7927
7 (60,70] 507.8944
8 (70,80] 458.4668
9 (80,90] 501.9672
10 (90,100] 493.4844
答案 1 :(得分:4)
使用cut
来形成群组,并使用tapply
对其进行汇总。
df$grp <- cut(df$x, seq(0, 100, 10))
with(df, tapply(y, grp, mean))
如果您是plyr
粉丝,您可能更喜欢
library(plyr)
ddply(df, .(grp), summarise, m = mean(y))
为完整起见,aggregate
版本为
aggregate(y ~ grp, df, mean)
答案 2 :(得分:3)
一种方法是使用cut()
从x
变量创建一个因子,指定每十个单位的中断。鉴于该因素,您可以使用by()
或aggregate()
或...来汇总数据框,或者更确切地说是y
列:
R> set.seed(42); DF <- data.frame(x=runif(50,1,100), y=rnorm(50,30,70))
R> summary(DF)
x y
Min. : 1.39 Min. :-179.5
1st Qu.:40.66 1st Qu.: -19.4
Median :64.45 Median : 39.6
Mean :60.29 Mean : 25.9
3rd Qu.:90.10 3rd Qu.: 74.7
Max. :98.90 Max. : 140.3
R> DF$cx <- cut(DF$x, breaks=seq(0,100,by=10))
R> ?by
R> by(DF, DF$cx, FUN=function(z) mean(z$y))
DF$cx: (0,10]
[1] 67.8747
---------------------------------------------
DF$cx: (10,20]
[1] 52.9104
---------------------------------------------
DF$cx: (20,30]
[1] -53.8961
---------------------------------------------
DF$cx: (30,40]
[1] 44.1992
---------------------------------------------
DF$cx: (40,50]
[1] 21.7404
---------------------------------------------
DF$cx: (50,60]
[1] 16.2122
---------------------------------------------
DF$cx: (60,70]
[1] -27.0338
---------------------------------------------
DF$cx: (70,80]
[1] 42.283
---------------------------------------------
DF$cx: (80,90]
[1] 40.8042
---------------------------------------------
DF$cx: (90,100]
[1] 38.8917
R>
或使用ddply()
:
R> library(plyr)
R> ddply(DF, .(cx), function(z) mean(z$y))
cx V1
1 (0,10] 67.8747
2 (10,20] 52.9104
3 (20,30] -53.8961
4 (30,40] 44.1992
5 (40,50] 21.7404
6 (50,60] 16.2122
7 (60,70] -27.0338
8 (70,80] 42.2830
9 (80,90] 40.8042
10 (90,100] 38.8917
R>
答案 3 :(得分:3)
我认为你的问题导致你的答案过于狭窄。你应该考虑使用回归方法来总结连续变量的联合关系。使用散点图和拟合回归样条绘制对基础关系的暴力程度将低于您指定的分段分析。
答案 4 :(得分:3)
以下是data.table
解决方案
require(data.table)
data.table(df)[,list(mean_y = mean(y)), by = 'cut(x, seq(0, 100, 10))']
答案 5 :(得分:2)
您可以使用tapply
与pretty
一起制作cut
的断点:
tapply(df$y,cut(df$x,pretty(range(df$x),high.u.bias=0.1)),mean)
(0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80]
496.9840 510.4164 502.4092 492.5806 493.3364 549.5207 507.4511 472.3391
(80,90] (90,100]
479.8795 482.6728
aggregate
也可以使用:
aggregate(df$y,list(cut(df$x,pretty(range(df$x),high.u.bias=0.1))),FUN=mean)
Group.1 x
1 (0,10] 496.9840
2 (10,20] 510.4164
3 (20,30] 502.4092
4 (30,40] 492.5806
5 (40,50] 493.3364
6 (50,60] 549.5207
7 (60,70] 507.4511
8 (70,80] 472.3391
9 (80,90] 479.8795
10 (90,100] 482.6728