Question

我的日期数据框架大约有一百万行

  id       date   variable
1  1 2015-01-01         NA
2  1 2015-01-02 -1.1874087
3  1 2015-01-03 -0.5936396
4  1 2015-01-04 -0.6131957
5  1 2015-01-05  1.0291688
6  1 2015-01-06 -1.5810152

可重复的例子在这里：

#create example data set
Df <- data.frame(id = factor(rep(1:3, each = 10)), 
     date = rep(seq.Date(from = as.Date('2015-01-01'), 
             to = as.Date('2015-01-10'), by = 1),3),
     variable = rnorm(30))
Df$variable[c(1,7,12,18,22,23,29)] <- NA

我想要做的是将variable中的NA值替换为每个id的上一个日期的值。我创建的循环工作但很慢（你可以在下面找到它）。你能否为这项任务提供快速替代方案。谢谢！

library(dplyr)

#create new variable
Df$variableNew <- Df$variable
#create row numbers vector
Df$n <- 1:dim(Df)[1]
#order data frame by date
Df <- arrange(Df, date)


for (id in levels(Df$id)){
    I <- Df$n[Df$id == id] # create vector of rows for specific id

    for (row in 1:length(I)){ #if variable == NA for the first date change it to mean value
        if (is.na(Df$variableNew[I[1]])) {
            Df$variableNew[I[row]] <- mean(Df$variable,na.rm = T)
        }
        if (is.na(Df$variableNew[I[row]])){ # if variable == NA fassign to this date value from previous date
            Df$variableNew[I[row]] <- Df$variableNew[I[row-1]]
        }
    }
}

Answer 1

这个data.table解决方案应该非常快。

library(zoo)         # for na.locf(...)
library(data.table)
setDT(Df)[,variable:=na.locf(variable, na.rm=FALSE),by=id]
Df[,variable:=if (is.na(variable[1])) c(mean(variable,na.rm=TRUE),variable[-1]) else variable,by=id]
Df
#     id       date     variable
#  1:  1 2015-01-01 -0.288720759
#  2:  1 2015-01-02 -0.005344028
#  3:  1 2015-01-03  0.707310667
#  4:  1 2015-01-04  1.034107735
#  5:  1 2015-01-05  0.223480415
#  6:  1 2015-01-06 -0.878707613
#  7:  1 2015-01-07 -0.878707613
#  8:  1 2015-01-08 -2.000164945
#  9:  1 2015-01-09 -0.544790740
# 10:  1 2015-01-10 -0.255670709
# ...

因此，这会使用NA的locf替换所有嵌入式id，然后进行第二次传递，将NA的所有variable替换为id }}。请注意，如果您这样做是相反的顺序，您可能会得到不同的答案。

Answer 2

如果您获得tidyr（0.3.0）available on github的开发版本，则会有一个函数fill，它将完全执行此操作：

#devtools::install_github("hadley/tidyr")
library(tidyr)
library(dplyr)
Df %>% group_by(id) %>% 
       fill(variable)

它不会做第一个值 - 我们可以用mutate做替换并替换：

Df %>% group_by(id) %>%
       mutate(variable = ifelse(is.na(variable) & row_number()==1, 
                                replace(variable, 1, mean(variable, na.rm = TRUE)),
                                variable)) %>% 
       fill(variable)

将NA替换为上一个日期的值

2 个答案: