滚动与id和日期的相关性

时间:2015-01-24 22:27:46

标签: r date time-series correlation

我有一些数据有名称,日期和两个因素(x,y)。我想计算

  dt<-seq(as.Date("2013/1/1"), by = "days", length.out = 20)
  df1<-data.frame("ABC",dt,rnorm(20, 0,3),rnorm(20, 2,4) )
      names(df1)<-c("name","date","x","y")
  df2<-data.frame("XYZ",dt,rnorm(20, 2,5),rnorm(20, 3,10) )
      names(df2)<-c("name","date","x","y")
  df<-rbind(df1,df2)

我想添加一个名为“Correl”的列,对于每个日期,它会获取前5个时段的相关性。但是,当名称发生变化时,我希望它能代替NA。

如下所示,当数据变为XYZ而不是ABC时,前4个时段的相关性为NA。当有5个数据点时,相关性再次开始。

  name  date    x   y   Correl
  ABC   1/1/2013    -3.59   -5.13   NA
  ABC   1/2/2013    -8.69   4.22    NA
  ABC   1/3/2013    2.80    -0.59   NA
  ABC   1/4/2013    0.54    5.06    NA
  ABC   1/5/2013    1.13    3.49    -0.03
  ABC   1/6/2013    0.52    5.16    -0.38
  ABC   1/7/2013    -0.24   -5.40   0.08
  ABC   1/8/2013    3.26    -2.75   -0.16
  ABC   1/9/2013    1.33    5.94    -0.04
  ABC   1/10/2013   2.24    1.14    -0.01
  ABC   1/11/2013   0.01    9.87    -0.24
  ABC   1/12/2013   2.29    1.28    -0.99
  ABC   1/13/2013   1.03    -6.30   -0.41
  ABC   1/14/2013   0.62    4.82    -0.47
  ABC   1/15/2013   1.08    -1.17   -0.50
  ABC   1/16/2013   2.43    8.86    0.45
  ABC   1/17/2013   -3.43   9.38    -0.35
  ABC   1/18/2013   -5.73   7.59    -0.38
  ABC   1/19/2013   1.77    3.13    -0.44
  ABC   1/20/2013   -0.97   -0.77   -0.24
  XYZ   1/1/2013    2.12    10.22   NA
  XYZ   1/2/2013    -0.81   0.22    NA
  XYZ   1/3/2013    -1.55   -2.25   NA
  XYZ   1/4/2013    -4.53   3.63    NA
  XYZ   1/5/2013    2.95    -1.51   0.13
  XYZ   1/6/2013    6.76    24.16   0.69
  XYZ   1/7/2013    3.33    7.31    0.66
  XYZ   1/8/2013    -1.47   -4.23   0.67
  XYZ   1/9/2013    3.89    -0.43   0.81
  XYZ   1/10/2013   5.63    17.95   0.86
  XYZ   1/11/2013   3.29    -7.09   0.63
  XYZ   1/12/2013   6.03    -9.03   0.29
  XYZ   1/13/2013   -5.63   6.96    -0.19
  XYZ   1/14/2013   1.70    13.59   -0.18
  XYZ   1/15/2013   -1.19   -16.79  -0.29
  XYZ   1/16/2013   4.76    4.91    -0.11
  XYZ   1/17/2013   9.02    25.16   0.57
  XYZ   1/18/2013   4.56    6.48    0.84
  XYZ   1/19/2013   5.30    11.81   0.99
  XYZ   1/20/2013   -0.60   3.38    0.84

更新:我已经尝试了所有建议,并使用实际数据遇到了问题。我附上了以下数据的子集:

https://www.dropbox.com/s/6k4xhwuinlu0p1f/TEST_SUBSET.csv?dl=0

我无法让这个工作。我尝试删除NA,重命名行,以不同方式读取数据,以不同方式格式化日期。没有什么对我有用。你能看到你正在运行的是否适合这个数据集吗?非常感谢大家!

3 个答案:

答案 0 :(得分:2)

ave应用于df的行索引以按名称处理,并使用rollapplyr执行滚动计算。请注意,i是索引的向量:

library(zoo)

corx <- function(x) cor(x[, 1], x[, 2])
df$Correl <- ave(1:nrow(df), df$name, FUN = function(i) 
      rollapplyr(df[i, c("x", "y")], 5, corx, by.column = FALSE, fill = NA))

更新rollapply更改为rollapplyr,使其与问题中显示的输出一致。如果您想要居中相关,请将其更改回rollapply

答案 1 :(得分:1)

以下是使用基础R的解决方案,请注意,它要求数据集按此顺序按namedate排序。

dt<-seq(as.Date("2013/1/1"), by = "days", length.out = 20)
df1<-data.frame("ABC",dt,rnorm(20, 0,3),rnorm(20, 2,4) )
names(df1)<-c("name","date","x","y")
df2<-data.frame("XYZ",dt,rnorm(20, 2,5),rnorm(20, 3,10) )
names(df2)<-c("name","date","x","y")
df<-rbind(df1,df2)

rollcorr = function(df, lag = 4) {
  out = numeric(nrow(df) - lag)
  for( i in seq_along(out) ) {
    window = i:(i+lag)
    out[i] = cor(df$x[window], df$y[window])
  }
  out <- c(rep(NA, lag), out)
  return(out)
}

df$Correl <- do.call(c, by(df[, -1], df[, 1], rollcorr))

答案 2 :(得分:1)

这对派对来说有点晚了,但下面是一个非常紧凑的解决方案,dplyrrollapply来自(zoo包)。

library(dplyr)
library(zoo)

  dt<-seq(as.Date("2013/1/1"), by = "days", length.out = 20)
  df1<-data.frame("ABC",dt,rnorm(20, 0,3),rnorm(20, 2,4) )
      names(df1)<-c("name","date","x","y")
  df2<-data.frame("XYZ",dt,rnorm(20, 2,5),rnorm(20, 3,10) )
      names(df2)<-c("name","date","x","y")
  df<-rbind(df1,df2)


df<-df %>%
  group_by(name)%>%
  arrange(date) %>%
  do({
    correl <- rollapply(.[-(1:2)],width = 5, function(a) cor(a[,1],a[,2]), by.column = FALSE, align = "right", fill = NA)
    data.frame(., correl)
  })

返回......

> df
Source: local data frame [40 x 5]
Groups: name

   name       date           x          y      correl
1   ABC 2013-01-01 -0.61707785 -0.7299461          NA
2   ABC 2013-01-02  1.35353618  9.1314743          NA
3   ABC 2013-01-03  2.60815932  0.2511828          NA
4   ABC 2013-01-04 -2.89619789 -1.2586655          NA
5   ABC 2013-01-05  2.23750886  4.6616034  0.52013407
6   ABC 2013-01-06 -1.97573999  3.6800832  0.37575664
7   ABC 2013-01-07  1.70360813  2.2621718  0.32390612
8   ABC 2013-01-08  0.02017797  2.5088032  0.64020507
9   ABC 2013-01-09  0.96263256  1.6711756 -0.00557611
10  ABC 2013-01-10 -0.62400803  5.2011656 -0.66040650
..  ...        ...         ...        ...         ...

检查另一组是否正确响应...

> df %>%
+   filter(name=="XYZ")
Source: local data frame [20 x 5]
Groups: name

   name       date          x          y     correl
1   XYZ 2013-01-01  3.4199729  5.0866361         NA
2   XYZ 2013-01-02  4.7326297 -5.4613465         NA
3   XYZ 2013-01-03  3.8983329 11.1635903         NA
4   XYZ 2013-01-04  1.5235936  3.9077184         NA
5   XYZ 2013-01-05 -5.4885373  7.8961020 -0.3755766
6   XYZ 2013-01-06  0.2311371  2.0157046 -0.3754510
7   XYZ 2013-01-07  2.6903306 -3.2940181 -0.1808097
8   XYZ 2013-01-08 -0.2584268  3.6047800 -0.8457930
9   XYZ 2013-01-09 -0.2897795  2.1029431 -0.9526992
10  XYZ 2013-01-10  5.9571558 18.5810947  0.7025559
11  XYZ 2013-01-11 -7.5250647 -8.0858699  0.7949917
12  XYZ 2013-01-12  2.8438336 -8.4072829  0.6563161
13  XYZ 2013-01-13  7.2295030 -0.1236801  0.5383666
14  XYZ 2013-01-14 -0.7579570 -0.2830291  0.5542751
15  XYZ 2013-01-15  4.3116507 -6.5291051  0.3894343
16  XYZ 2013-01-16  1.4334510  0.5957465 -0.1480032
17  XYZ 2013-01-17 -2.6444881  6.1261976 -0.6183805
18  XYZ 2013-01-18  0.8517223  0.5587499 -0.9243050
19  XYZ 2013-01-19  6.2140131 -3.0944259 -0.8939475
20  XYZ 2013-01-20 11.2871086 -0.1187153 -0.6845300

希望这有帮助!

关注


我刚刚在您的实际数据集上运行了以下内容:

library(dplyr)
library(zoo)
import <- read.csv("TEST_SUBSET.CSV", header=TRUE, stringsAsFactors = FALSE)
str(head(import))

import_df<-import %>%
  group_by(id)%>%
  arrange(asof_dt) %>%
  do({
    correl <- rollapply(.[-(1:2)],width = 5, function(a) cor(a[,1],a[,2]), by.column = FALSE, align = "right", fill = NA)
    data.frame(., correl)
  })
import_df

并收到以下内容:

> import_df
Source: local data frame [15,365 x 5]
Groups: id

       id   asof_dt            x            y     correl
1  DC1123 1/10/1990 -0.003773632           NA         NA
2  DC1123 1/10/1991  0.014034992           NA         NA
3  DC1123 1/10/1992 -0.004109765           NA         NA
4  DC1123 1/10/1994  0.006369326  0.012176085         NA
5  DC1123 1/10/1995  0.014900600  0.001241080         NA
6  DC1123 1/10/1996  0.005763689 -0.013112491         NA
7  DC1123 1/10/1997  0.006949765  0.010737034         NA
8  DC1123 1/10/2000  0.044052805  0.003346296 0.02724175
9  DC1123 1/10/2001  0.009452785  0.017582638 0.01362101
10 DC1123 1/10/2002 -0.018876970  0.004346372 0.01343657
..    ...       ...          ...          ...        ...

所以感觉就像它的工作一样 (cor)函数只有在有5个输入点时返回数据,直到第8行才会发生。