我的日期数据框架大约有一百万行
id date variable
1 1 2015-01-01 NA
2 1 2015-01-02 -1.1874087
3 1 2015-01-03 -0.5936396
4 1 2015-01-04 -0.6131957
5 1 2015-01-05 1.0291688
6 1 2015-01-06 -1.5810152
可重复的例子在这里:
#create example data set
Df <- data.frame(id = factor(rep(1:3, each = 10)),
date = rep(seq.Date(from = as.Date('2015-01-01'),
to = as.Date('2015-01-10'), by = 1),3),
variable = rnorm(30))
Df$variable[c(1,7,12,18,22,23,29)] <- NA
我想要做的是将variable
中的NA值替换为每个id
的上一个日期的值。我创建的循环工作但很慢(你可以在下面找到它)。你能否为这项任务提供快速替代方案。谢谢!
library(dplyr)
#create new variable
Df$variableNew <- Df$variable
#create row numbers vector
Df$n <- 1:dim(Df)[1]
#order data frame by date
Df <- arrange(Df, date)
for (id in levels(Df$id)){
I <- Df$n[Df$id == id] # create vector of rows for specific id
for (row in 1:length(I)){ #if variable == NA for the first date change it to mean value
if (is.na(Df$variableNew[I[1]])) {
Df$variableNew[I[row]] <- mean(Df$variable,na.rm = T)
}
if (is.na(Df$variableNew[I[row]])){ # if variable == NA fassign to this date value from previous date
Df$variableNew[I[row]] <- Df$variableNew[I[row-1]]
}
}
}
答案 0 :(得分:3)
这个data.table解决方案应该非常快。
library(zoo) # for na.locf(...)
library(data.table)
setDT(Df)[,variable:=na.locf(variable, na.rm=FALSE),by=id]
Df[,variable:=if (is.na(variable[1])) c(mean(variable,na.rm=TRUE),variable[-1]) else variable,by=id]
Df
# id date variable
# 1: 1 2015-01-01 -0.288720759
# 2: 1 2015-01-02 -0.005344028
# 3: 1 2015-01-03 0.707310667
# 4: 1 2015-01-04 1.034107735
# 5: 1 2015-01-05 0.223480415
# 6: 1 2015-01-06 -0.878707613
# 7: 1 2015-01-07 -0.878707613
# 8: 1 2015-01-08 -2.000164945
# 9: 1 2015-01-09 -0.544790740
# 10: 1 2015-01-10 -0.255670709
# ...
因此,这会使用NA
的locf替换所有嵌入式id
,然后进行第二次传递,将NA
的所有variable
替换为id
}}。请注意,如果您这样做是相反的顺序,您可能会得到不同的答案。
答案 1 :(得分:0)
如果您获得tidyr
(0.3.0)available on github的开发版本,则会有一个函数fill
,它将完全执行此操作:
#devtools::install_github("hadley/tidyr")
library(tidyr)
library(dplyr)
Df %>% group_by(id) %>%
fill(variable)
它不会做第一个值 - 我们可以用mutate做替换并替换:
Df %>% group_by(id) %>%
mutate(variable = ifelse(is.na(variable) & row_number()==1,
replace(variable, 1, mean(variable, na.rm = TRUE)),
variable)) %>%
fill(variable)