数据框内的预测

时间:2016-06-06 12:40:20

标签: r dataframe dplyr

我试图想出一种在数据框中进行投影的方法,最好是dplyr

library("dplyr")

set.seed(1)
df0 <- data_frame(t = 0:5, 
                  r = c(NA, rnorm(n = 5, mean = 1, sd = 0.1)), 
                  P = c(100, rep(x = NA, times = 5)))
df0
# Source: local data frame [6 x 3]
# 
#       t         r     P
#   (int)     (dbl) (dbl)
# 1     0        NA   100
# 2     1 0.9373546    NA
# 3     2 1.0183643    NA
# 4     3 0.9164371    NA
# 5     4 1.1595281    NA
# 6     5 1.0329508    NA

我对如何以递归方式运行投影模型感到困惑......

df0 %>%
  mutate(P = ifelse(test = is.na(P), yes = lag(P)*r, no = P))
# Source: local data frame [6 x 3]
# 
#       t         r         P
#   (int)     (dbl)     (dbl)
# 1     0        NA 100.00000
# 2     1 0.9373546  93.73546
# 3     2 1.0183643        NA
# 4     3 0.9164371        NA
# 5     4 1.1595281        NA
# 6     5 1.0329508        NA

有人知道这是否可行?

我想到的是使用group_by在多个地区执行此操作。数据框将非常大,因此优先考虑快速解决除data.frame类型对象之外的其他问题。

到目前为止,我能想到的唯一解决方案是使用for循环...

for(i in 1:5)
  df0 <- df0 %>% mutate(P = ifelse(is.na(P), yes = lag(P)*r, no = P))
df0
# Source: local data frame [6 x 3]
# 
#       t         r         P
#   (int)     (dbl)     (dbl)
# 1     0        NA 100.00000
# 2     1 0.9373546  93.73546
# 3     2 1.0183643  95.45685
# 4     3 0.9164371  87.48020
# 5     4 1.1595281 101.43575
# 6     5 1.0329508 104.77814

...这可能导致我的大数据集出现内存问题,并且考虑到我在R中读到的for循环,可能不是最好的解决方案。

修改

使用purrrsimulations非常相似的问题的一些不错的答案。写在博客post中。

1 个答案:

答案 0 :(得分:1)

因为您提到了快速解决方案,所以可能是data.table解决方案?

DT <- data.table(df0)
for(i in 1:nrow(DT)) 
    set(DT, j = 3L, value = ifelse(is.na(DT$P), yes = lag(DT$P)*DT$r, no = DT$P))   
DT
 t         r         P
1: 0        NA 100.00000
2: 1 0.9373546  93.73546
3: 2 1.0183643  95.45685
4: 3 0.9164371  87.48020
5: 4 1.1595281 101.43575
6: 5 1.0329508 104.77814

速度比较......

f_dt <- 
  function(){
    for (i in 1:nrow(DT))
     set(DT, j = 3L, value = ifelse(is.na(DT[,P]), yes = lag(DT$P)*DT$r, no = DT$P)) 
    DT
  }

f_dplyr <- 
  function(){
    for (i in 1:nrow(df0))
      df0 <- mutate(df0, P = ifelse(is.na(P), yes = lag(P)*r, no = P))
    df0
  }
f_cumprod <- 
  function(){
    res <- c(df0$P[1],df0$P[1]*cumprod(df0$r[-1]))
    res
  }

library(microbenchmark)

microbenchmark(f_dt(),f_dplyr(),f_cumprod(),times = 100)

Unit: microseconds # only 500 rows
        expr        min         lq         mean      median         uq        max neval
      f_dt() 178350.056 186226.605 192842.91784 190115.9120 195791.748 272405.911   100
   f_dplyr() 307450.092 323326.566 331586.39073 328444.5255 335888.287 387716.640   100
 f_cumprod()     27.798     34.213     45.11819     43.4075     52.175     75.268   100