为多个时间序列创建“昨天的价值”变量

时间:2015-11-12 18:33:31

标签: r performance time-series

我正在研究R中的一个项目,我有点卡住了。我有这种格式的四个时间序列:

x <- data.frame(Id = rep(c(1,2,3,4),2), 
                Date = c(rep("1980-01-01",4), rep("1980-01-02",4)),
                Freq = c(2,3,1,2,4,5,2,3))

ID        Date        Freq
1   1980 - 01 - 01      2
2   1980 - 01 - 01      3
3   1980 - 01 - 01      1
4   1980 - 01 - 01      2
1   1980 - 01 - 02      4
2   1980 - 01 - 02      5  
3   1980 - 01 - 02      2
4   1980 - 01 - 02      3

我的目标是创建一个新变量,它只是昨天该组的频率值。

ID        Date        Freq   YestFreq
1   1980 - 01 - 01      2       NA
2   1980 - 01 - 01      3       NA
3   1980 - 01 - 01      1       NA
4   1980 - 01 - 01      2       NA 
1   1980 - 01 - 02      4       2
2   1980 - 01 - 02      5       3
3   1980 - 01 - 02      2       1
4   1980 - 01 - 02      3       2

我尝试的解决方案是:

x$DateID = paste(x$ID, x$Date)
x$yesterday = as.Date(x$Date) - 1
x$YesterdayDateID = paste(x$ID, x$yesterday)

result = numeric(nrow(x))
for(i in 1:nrow(x)){
  answer = x$Freq[which(x$DateID == x$yesterdayDateID[i])]
  if(length(answer) != 0){result[i] = answer} else{result[i] = NA}
}
x = cbind(x, result)

我的实际数据集有~600000行,(~300 Id和〜2000个唯一日期),所以我的上述解决方案需要2个小时才能运行。任何帮助将不胜感激。

3 个答案:

答案 0 :(得分:5)

考虑到昨天可能存在的差距。我使用match来识别前一天。从该索引然后按Id:

对目标列进行子集化

<强> data.table

library(data.table)
setDT(x)[, Date := as.IDate(Date)][
, YestFreq := Freq[match(Date-1L, Date)], by=Id][]
#   Id       Date Freq YestFreq
# 1:  1 1980-01-01    2       NA
# 2:  2 1980-01-01    3       NA
# 3:  3 1980-01-01    1       NA
# 4:  4 1980-01-01    2       NA
# 5:  1 1980-01-02    4        2
# 6:  2 1980-01-02    5        3
# 7:  3 1980-01-02    2        1
# 8:  4 1980-01-02    3        2

<强> dplyr

library(dplyr)
x$Date <- as.Date(x$Date)
x %>% group_by(Id) %>% mutate(YestFreq = Freq[match(Date - 1L, Date)])
#   Id       Date Freq YestFreq
# 1  1 1980-01-01    2       NA
# 2  2 1980-01-01    3       NA
# 3  3 1980-01-01    1       NA
# 4  4 1980-01-01    2       NA
# 5  1 1980-01-02    4        2
# 6  2 1980-01-02    5        3
# 7  3 1980-01-02    2        1
# 8  4 1980-01-02    3        2

答案 1 :(得分:2)

我们可以尝试

library(dplyr)
x %>%
  arrange(as.Date(Date), Id) %>%
  group_by(Id) %>%
  mutate(YestFreq = lag(Freq))
#    Id       Date  Freq YestFreq
#  (dbl)     (fctr) (dbl)    (dbl)
#1     1 1980-01-01     2       NA
#2     2 1980-01-01     3       NA
#3     3 1980-01-01     1       NA
#4     4 1980-01-01     2       NA
#5     1 1980-01-02     4        2
#6     2 1980-01-02     5        3
#7     3 1980-01-02     2        1
#8     4 1980-01-02     3        2

答案 2 :(得分:2)

对于快速解决方案,请使用data.table包,对数据进行排序,并使用上一行的Freq值为每个组派生一列:

library(data.table)

x <- data.frame(Id = rep(c(1,2,3,4),2), Date = c(rep("1980-01-01",4), rep("1980-01-02",4)), Freq = c(2,3,1,2,4,5,2,3))

# The real solution starts here (could even be done in one row):
y <- setDT(x)      # convert to data.table
setkey(y,Id,Date)  # "sort" the data
y[, .(Date, Freq, YestFreq=c(NA, Freq[1:(.N-1)])), by=.(Id)]

输出(按顺序排列不同 - >按Id):

   Id       Date Freq YestFreq
1:  1 1980-01-01    2       NA
2:  1 1980-01-02    4        2
3:  2 1980-01-01    3       NA
4:  2 1980-01-02    5        3
5:  3 1980-01-01    1       NA
6:  3 1980-01-02    2        1
7:  4 1980-01-01    2       NA
8:  4 1980-01-02    3        2

修改1:

你可以在一行中完成(并按要求对结果进行排序):

library(data.table)
x <- data.frame(Id = rep(c(1,2,3,4),2), Date = c(rep("1980-01-01",4), rep("1980-01-02",4)), Freq = c(2,3,1,2,4,5,2,3))

setDT(x, key=c("Id", "Date"))[, YestFreq := c(NA, Freq[1:(.N-1)]), by=Id][order(Date, Id)]

结果:

   Id       Date Freq YestFreq
1:  1 1980-01-01    2       NA
2:  2 1980-01-01    3       NA
3:  3 1980-01-01    1       NA
4:  4 1980-01-01    2       NA
5:  1 1980-01-02    4        2
6:  2 1980-01-02    5        3
7:  3 1980-01-02    2        1
8:  4 1980-01-02    3        2