我有很多小组多年来收集的数据(值)。我想计算每组的12个月滚动中位数(使用前12个月)。我查看了动物园(和其他)滚动功能,它们似乎都是固定的间隔(例如12个月= 12行),但我的数据中的日期间隔对于每个组都不一致。每月或每隔一个月收集一次数据值,但也存在其他差距。我想我需要一个滚动中值函数,它收集每组的前364天的值。我很感激一些帮助!
以下是我的数据示例:
Date Group Value
8/17/2013 A 5
10/2/2013 A 13
1/15/2014 A 11
3/15/2014 A 2
5/22/2014 A 7
7/15/2014 A 1
9/3/2014 A 1
11/15/2014 A 7
7/22/2013 B 13
8/5/2013 B 13
9/7/2013 B 12
10/16/2013 B 6
11/17/2013 B 5
12/9/2013 B 15
1/30/2014 B 1
2/23/2014 B 10
3/24/2014 B 15
4/5/2014 B 3
5/26/2014 B 3
6/16/2014 B 4
8/5/2014 B 6
9/26/2014 B 8
10/16/2014 B 15
11/29/2014 B 12
12/13/2016 B 1
我想在此表“Rolling Median”中添加一列,其中包含每组的滚动12个月(或365天)中位数。
答案 0 :(得分:2)
此软件包可能对您有所帮助:
https://github.com/mgahan/boRingTrees
它被称为boRingTrees
,它处理这类问题。如果您不想下载该软件包,也可以使用以下代码:
##Utilize the data.table package
library(data.table)
setDT(data)
data[, Date2 := as.Date(Date,format="%m/%d/%Y")] #Format date field
#Apply rollingByCalcs function (full function code is below)
data[, Roll_Median := rollingByCalcs(data,bylist=c("Group"),dates="Date2",target="Value",
lower=0,upper=365,incbounds=T,stat=median,na.rm=T,cores=1)]
rollingByCalcs <- function(data,bylist=NULL,dates,target=NULL,
lower,upper,incbounds=T,stat=length,na.rm=T,cores=1){
tic <- Sys.time()
require("data.table")
require("parallel")
data <- data.table(data)
if (is.null(bylist)){
data[, id.filler := 1]
bylist <- "id.filler"
}
if (is.null(target)){
data[,target:=1]
target <- "target"
}
##Create group by variable
data[,Grp.Var:=.GRP,by=bylist]
##Assign variable names
data[,target:=data[,eval(parse(text=target))]]
data[,dates:=data[,eval(parse(text=dates))]]
##Create "list" of comparison dates
Ref <- data[,list(Compare_Value=list(I(target)),Compare_Date=list(I(dates))), by=c("Grp.Var")]
##Compare two lists and see of the compare date is within N days
data$Roll.Val <- mcmapply(FUN = function(RD, NUM) {
d <- as.numeric(RD-Ref$Compare_Date[[NUM]])
true.vals <- between(x=d,lower=lower,upper=upper,incbounds=incbounds)
out <- stat(Ref$Compare_Value[[NUM]][true.vals])
return(out)
}, RD = data$dates,NUM=data$Grp.Var,mc.cores=cores)
print(Sys.time()-tic)
return(data$Roll.Val)
}
答案 1 :(得分:0)
你可以写一个辅助函数。这是使用dplyr包的人:
library(dplyr)
rollingMedian <- function(targetDate, targetGroup) {
dat %>%
mutate(thisDiff = difftime(as.Date(Date), targetDate, unit = "days")) %>%
filter(thisDiff < 0, thisDiff > -366, Group == targetGroup) %>%
summarise(medValue = median(Value))
}
dat$rollingMed <- mapply(rollingMedian, dat$Date, dat$Group)
<强>结果:强>
dat
Date Group Value rollingMed
1 2013-08-17 A 5 NA
2 2013-10-02 A 13 5
3 2014-01-15 A 11 9
4 2014-03-15 A 2 11
5 2014-05-22 A 7 8
6 2014-07-15 A 1 7
7 2014-09-03 A 1 7
8 2014-11-15 A 7 2
9 2013-07-22 B 13 NA
10 2013-08-05 B 13 13
...
使用的数据:
dat <- structure(list(Date = structure(c(1376697600, 1380672000, 1389744000,
1394841600, 1400716800, 1405382400, 1409702400, 1416009600, 1374451200,
1375660800, 1378512000, 1381881600, 1384646400, 1386547200, 1391040000,
1393113600, 1395619200, 1396656000, 1401062400, 1402876800, 1407196800,
1411689600, 1413417600, 1417219200, 1481587200), tzone = "UTC", class = c("POSIXct",
"POSIXt")), Group = c("A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B"), Value = c(5L, 13L, 11L, 2L, 7L, 1L, 1L,
7L, 13L, 13L, 12L, 6L, 5L, 15L, 1L, 10L, 15L, 3L, 3L, 4L, 6L,
8L, 15L, 12L, 1L)), .Names = c("Date", "Group", "Value"), row.names = c(NA,
-25L), class = "data.frame")