R - 按组间滚动12个月的中位数,时间间隔不一致

时间:2015-07-16 16:19:44

标签: r group-by moving-average

我有很多小组多年来收集的数据(值)。我想计算每组的12个月滚动中位数(使用前12个月)。我查看了动物园(和其他)滚动功能,它们似乎都是固定的间隔(例如12个月= 12行),但我的数据中的日期间隔对于每个组都不一致。每月或每隔一个月收集一次数据值,但也存在其他差距。我想我需要一个滚动中值函数,它收集每组的前364天的值。我很感激一些帮助!

以下是我的数据示例:

Date    Group   Value
8/17/2013   A   5
10/2/2013   A   13
1/15/2014   A   11
3/15/2014   A   2
5/22/2014   A   7
7/15/2014   A   1
9/3/2014    A   1
11/15/2014  A   7
7/22/2013   B   13
8/5/2013    B   13
9/7/2013    B   12
10/16/2013  B   6
11/17/2013  B   5
12/9/2013   B   15
1/30/2014   B   1
2/23/2014   B   10
3/24/2014   B   15
4/5/2014    B   3
5/26/2014   B   3
6/16/2014   B   4
8/5/2014    B   6
9/26/2014   B   8
10/16/2014  B   15
11/29/2014  B   12
12/13/2016  B   1

我想在此表“Rolling Median”中添加一列,其中包含每组的滚动12个月(或365天)中位数。

2 个答案:

答案 0 :(得分:2)

此软件包可能对您有所帮助:

https://github.com/mgahan/boRingTrees

它被称为boRingTrees,它处理这类问题。如果您不想下载该软件包,也可以使用以下代码:

##Utilize the data.table package
library(data.table)
setDT(data)
data[, Date2 := as.Date(Date,format="%m/%d/%Y")] #Format date field

#Apply rollingByCalcs function (full function code is below)
data[, Roll_Median := rollingByCalcs(data,bylist=c("Group"),dates="Date2",target="Value",
                      lower=0,upper=365,incbounds=T,stat=median,na.rm=T,cores=1)]


rollingByCalcs <- function(data,bylist=NULL,dates,target=NULL,
                           lower,upper,incbounds=T,stat=length,na.rm=T,cores=1){
  tic <- Sys.time()

  require("data.table")
  require("parallel")
  data <- data.table(data)


  if (is.null(bylist)){
    data[, id.filler := 1]
    bylist <- "id.filler"
  }

  if (is.null(target)){
    data[,target:=1]
    target <- "target"
  }

  ##Create group by variable
  data[,Grp.Var:=.GRP,by=bylist]

  ##Assign variable names
  data[,target:=data[,eval(parse(text=target))]]
  data[,dates:=data[,eval(parse(text=dates))]]

  ##Create "list" of comparison dates
  Ref <- data[,list(Compare_Value=list(I(target)),Compare_Date=list(I(dates))), by=c("Grp.Var")]

  ##Compare two lists and see of the compare date is within N days
  data$Roll.Val <- mcmapply(FUN = function(RD, NUM) {
    d <- as.numeric(RD-Ref$Compare_Date[[NUM]])
    true.vals <- between(x=d,lower=lower,upper=upper,incbounds=incbounds)  
    out <- stat(Ref$Compare_Value[[NUM]][true.vals])
    return(out)
  }, RD = data$dates,NUM=data$Grp.Var,mc.cores=cores)

  print(Sys.time()-tic)
  return(data$Roll.Val)
}

答案 1 :(得分:0)

你可以写一个辅助函数。这是使用dplyr包的人:

library(dplyr)
rollingMedian <- function(targetDate, targetGroup) {
  dat %>%
    mutate(thisDiff = difftime(as.Date(Date), targetDate, unit = "days")) %>%
    filter(thisDiff < 0, thisDiff > -366, Group == targetGroup) %>%
    summarise(medValue = median(Value))
}

dat$rollingMed <- mapply(rollingMedian, dat$Date, dat$Group)

<强>结果:

dat
         Date Group Value rollingMed
1  2013-08-17     A     5         NA
2  2013-10-02     A    13          5
3  2014-01-15     A    11          9
4  2014-03-15     A     2         11
5  2014-05-22     A     7          8
6  2014-07-15     A     1          7
7  2014-09-03     A     1          7
8  2014-11-15     A     7          2
9  2013-07-22     B    13         NA
10 2013-08-05     B    13         13
...

使用的数据:

dat <- structure(list(Date = structure(c(1376697600, 1380672000, 1389744000, 
1394841600, 1400716800, 1405382400, 1409702400, 1416009600, 1374451200, 
1375660800, 1378512000, 1381881600, 1384646400, 1386547200, 1391040000, 
1393113600, 1395619200, 1396656000, 1401062400, 1402876800, 1407196800, 
1411689600, 1413417600, 1417219200, 1481587200), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), Group = c("A", "A", "A", "A", "A", "A", "A", "A", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B"), Value = c(5L, 13L, 11L, 2L, 7L, 1L, 1L, 
7L, 13L, 13L, 12L, 6L, 5L, 15L, 1L, 10L, 15L, 3L, 3L, 4L, 6L, 
8L, 15L, 12L, 1L)), .Names = c("Date", "Group", "Value"), row.names = c(NA, 
-25L), class = "data.frame")