我想计算两个日期之间变量的平均值,下面是可重现的数据框。
year <- c(1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,
1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,
1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,
1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997)
month <- c("JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC")
station <- c("A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B")
concentration <- as.numeric(round(runif(48,20,40),1))
df <- data.frame(year,month,station,concentration)
id <- c(1,2,3,4)
station1996 <- c("A","A","B","B")
station1997 <- c("B","A","A","B")
start <- c("06/01/1996","07/01/1996","07/01/1996","08/01/1996")
end <- c("04/01/1997","04/01/1997","04/01/1997","05/01/1997")
participant <- data.frame(id,station1996,station1997,start,end)
participant$start <- as.Date(participant$start, format = "%m/%d/%Y")
participant$end <- as.Date(participant$end, format = "%m/%d/%Y")
所以我有两个数据集如下
df
year month station concentration
1 1996 JAN A 24.4
2 1996 FEB A 37.0
3 1996 MAR A 39.5
4 1996 APR A 28.0
...
45 1997 SEP B 37.7
46 1997 OCT B 35.2
47 1997 NOV B 26.8
48 1997 DEC B 40.0
participant
id station1996 station1997 start end
1 1 A B 1996-06-01 1997-04-01
2 2 A A 1996-07-01 1997-04-01
3 3 B A 1996-07-01 1997-04-01
4 4 B B 1996-08-01 1997-05-01
对于每个id,我想计算开始日期和结束日期(月份)之间的平均浓度。注意到该站可能会在不同年份之间发生变化。
例如,对于id = 1,我想计算1996年6月和1997年APR之间的平均浓度。这应该基于1996年6月至1996年12月在A站和1997年1月至1997年APR的浓度。乙
有人可以帮忙吗?
非常感谢。
答案 0 :(得分:1)
这是一个data.table解决方案。基本思路是将每个yearmon
的起始范围内的所有日期枚举为id
,然后将其用作浓度表df
的索引。这有点令人费解,所以希望有人会出现并向您展示一种更简单的方式。
library(data.table)
library(zoo) # for as.yearmon(...)
setDT(df) # convert to data.table
setDT(participant)
df[, yrmon:= as.yearmon(paste(year,month,sep="-"), format="%Y-%B")] # add year-month column
p.melt <- reshape(participant, varying=2:3, direction="long", sep="", timevar="year")
x <- participant[, .(date=seq(start,end,by="month")), by=id]
x[, c("year","yrmon"):=.(year(date),as.yearmon(date))] # add year and year-month
x[p.melt, station:=station, on=c("id","year")] # add station
x[df, conc:= concentration, on=c("yrmon","station"), nomatch=0] # add concentration
setorder(x,id) # not necessary, but makes it easier to interpret x
result <- x[, .(mean.conc=mean(conc)), by=id] # mean(conc) by id
result
# id mean.conc
# 1: 1 28.61818
# 2: 2 28.56000
# 3: 3 28.44000
# 4: 4 29.60000
所以,首先我们将所有内容转换为data.tables。然后,我们将yrmon
列添加到df
以便稍后进行索引。然后我们通过将p.melt
重新整形为长格式来创建participant
,以便工作站位于一列中,指示符(1996或1997)位于单独的列中。然后我们创建一个临时表x
,其中包含每个id
的日期序列,并为每个日期添加年份和年份。然后我们将其与p.melt
和id
上的year
合并,以将电台列添加到x
。然后,我们使用yrmon
和station
将x
与df
合并,以获得适当的集中度。然后,我们只需使用conc
在id
中按x
汇总mean(...)
。