我试图平均每个“年”的“var1”的前5个数据点。我的数据如下。每年的数据长度不一样。非常感谢你的帮助! :)
year var1
2010 1
2010 2
2010 3
2010 4
2010 2
2010 4
2009 2
2009 3
2009 4
2009 1
2009 3
2009 4
2009 2
2009 5
2009 3
2009 6
2008 4
2008 2
2008 3
2008 4
2008 1
2008 3
答案 0 :(得分:2)
这样的东西?
t <- read.csv("t.txt", sep="") ## Read data
myMean <- function(x) ifelse(length(x)<5, mean(x), mean(x[1:5]))
ans <- aggregate(var1 ~ year, data = t, FUN = myMean)
ans
year var1
1 2008 14
2 2009 13
3 2010 12
我们创建一个函数myMean
来计算给定向量的前5个元素的平均值。
对于ifelse
,以防万一,如果某些年份没有5个数据点,那么我们采用所有数据点的均值。
我们使用函数aggregate
对year
的数据集进行分区。对于每个year
,我们在myMean
上应用var1
函数。
答案 1 :(得分:2)
使用data.table,我们会转换&#39; data.frame&#39;到&#39; data.table&#39; (setDT(df1)
),按&#39;年份&#39;分组,我们得到&#39; var1&#39;的前5个值。 (使用head
)并获取mean
library(data.table)
setDT(df1)[, list(var1=mean(head(var1,5))), year]
# year var1
#1: 2010 2.4
#2: 2009 2.6
#3: 2008 2.8
答案 2 :(得分:1)
以下是另一个选项,使用split
和lapply
:
sapply( split(X$var1,X$year), function(x) ifelse(length(x)<5, mean(x), mean(x[1:5])) )
其中X
是给定的数据帧。
速度比较:
library(microbenchmark)
library(data.table)
X <- read.table( header = TRUE,
text = "year var1
2011 3
2011 8
2010 1
2010 2
2010 3
2010 4
2010 2
2010 4
2009 2
2009 3
2009 4
2009 1
2009 3
2009 4
2009 2
2009 5
2009 3
2009 6
2008 4
2008 2
2008 3
2008 4
2008 1
2008 3" )
myMean <- function(x) ifelse(length(x)<5, mean(x), mean(x[1:5]))
microbenchmark(
akrun = setDT(X)[, list(var1=mean(head(var1,5))), year],
PoChoi.1 = aggregate(var1 ~ year, data = X, FUN = myMean),
PoChoi.2 = aggregate(var1 ~ year, data = X, FUN = function(x) ifelse(length(x)<5, mean(x), mean(x[1:5]))),
mra68.1 = sapply( split(X$var1,X$year), myMean ),
mra68.2 = sapply( split(X$var1,X$year), function(x) ifelse(length(x)<5, mean(x), mean(x[1:5])) ),
times = 1000
)
# Unit: microseconds
# expr min lq mean median uq max neval
# akrun 1781.673 3571.9520 3747.1294 3772.582 3931.320 147014.295 1000
# PoChoi.1 2273.966 4563.1035 4498.3682 4739.505 4982.933 9535.620 1000
# PoChoi.2 2289.817 4571.2555 4515.6098 4733.391 4956.892 21497.376 1000
# mra68.1 347.368 693.8295 711.6527 731.420 769.462 5615.848 1000
# mra68.2 346.915 694.7350 717.4941 730.740 772.633 5560.143 1000
akrun
是一个数据表,PoChoi
是一个数据框,mra68
是一个命名向量:
> akrun
year var1
1: 2011 5.5
2: 2010 2.4
3: 2009 2.6
4: 2008 2.8
> PoChoi.1
year var1
1 2008 2.8
2 2009 2.6
3 2010 2.4
4 2011 5.5
> PoChoi.2
year var1
1 2008 2.8
2 2009 2.6
3 2010 2.4
4 2011 5.5
> mra68.1
2008 2009 2010 2011
2.8 2.6 2.4 5.5
> mra68.2
2008 2009 2010 2011
2.8 2.6 2.4 5.5
一个更大的例子:
library(microbenchmark)
library(data.table)
set.seed(1)
X <- data.frame( year = sample( 1500:2015, 10000, replace=TRUE ),
var1 = sample( 1:10, 10000, replace=TRUE ) )
myMean <- function(x) ifelse(length(x)<5, mean(x), mean(x[1:5]))
microbenchmark(
akrun = setDT(X)[, list(var1=mean(head(var1,5))), year],
PoChoi.1 = aggregate(var1 ~ year, data = X, FUN = myMean),
PoChoi.2 = aggregate(var1 ~ year, data = X, FUN = function(x) ifelse(length(x)<5, mean(x), mean(x[1:5]))),
mra68.1 = sapply( split(X$var1,X$year), myMean ),
mra68.2 = sapply( split(X$var1,X$year), function(x) ifelse(length(x)<5, mean(x), mean(x[1:5])) ),
times = 1000
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# akrun 15.44811 23.50436 36.81674 43.12405 44.22435 69.62202 1000
# PoChoi.1 33.96411 51.52858 83.29682 95.53486 99.60884 241.59967 1000
# PoChoi.2 33.64844 51.70747 83.47835 95.07223 99.44127 247.55881 1000
# mra68.1 11.05145 17.33191 27.21526 31.41954 32.34819 126.89461 1000
# mra68.2 11.05054 17.16615 26.96236 31.25061 32.14054 85.44422 1000
除了不同的类(data.table,data.frame,named vector),结果是相同的:
> all( PoChoi.1 == PoChoi.2 )
[1] TRUE
> all( PoChoi.1$var1 == mra68.1 )
[1] TRUE
> all( mra68.1 == mra68.2 )
[1] TRUE
> all( akrun$var1[order(akrun$year)] == mra68.1 )
[1] TRUE
>