在具有多个观察期的数据框中添加缺失日期值

时间:2014-06-11 06:05:19

标签: r date for-loop merge sequence

提前致谢。

我正在尝试为三个不同的人添加未包含在观察期内的缺失日期值。

我的数据如下:

 IndID       Date Event Number Percent
1   P01 2011-03-04     1      2   0.390
2   P01 2011-03-11     1      2   0.975
3   P01 2011-03-13     0      9   0.795
4   P01 2011-03-14     0     10   0.516
5   P01 2011-03-15     0      1   0.117
6   P01 2011-03-17     0      7   0.093

IndID是个人ID(P01P03P06)。 Date显然是日期。 Event是一个二进制变量,指示事件是否发生(0 =否和1 =是)。
NumberPercent不是直接相关的,但需要保留,因此包含在此处。

我的示例数据框(PostData)包含在下面,使用dput

对于每个IndID,第一个和最后一个Date分别是观察期的开始和结束,其中缺少日期。在此,我的目标是为每个人添加缺少的日期,并在0列中添加Event。其他列(NumberPercent)可以保持空白。

This post一直很有用,但缺乏关于我的主要问题的信息 - 多个人。

每个人的观察期从min(PostData$Date)max(PostData$Date)。我一直在尝试为每个人创建一个完整的日期序列,然后使用merge循环中的现有数据框for。肯定有更好的主意。

任何建议都表示赞赏。

PostData <-structure(list(IndID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
  3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
  5L, 5L), .Label = c("P01", "P02", "P03", "P05", "P06", "P07", 
  "P08", "P09", "P10", "P11", "P12", "P13"), class = "factor"), 
  Date = structure(c(1299196800, 1299801600, 1299974400, 1300060800, 
  1300147200, 1300320000, 1300406400, 1310083200, 1310169600, 
  1310515200, 1310774400, 1310947200, 1311033600, 1311292800, 
  1311552000, 1323129600, 1323388800, 1323648000, 1323993600, 
  1324080000, 1324166400, 1324339200, 1327622400, 1327795200, 
  1327881600), class = c("POSIXct", "POSIXt"), tzone = "GMT"), 
  Event = c(1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 
  0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L), Number = c(2L, 
  2L, 9L, 10L, 1L, 7L, 5L, 9L, 1L, 4L, 5L, 2L, 0L, 1L, 10L, 
  5L, 0L, 6L, 5L, 10L, 9L, 4L, 4L, 8L, 1L), Percent = c(0.39, 
  0.975, 0.795, 0.516, 0.117, 0.093, 0.528, 0.659, 0.308, 0.055, 
  0.185, 0.761, 0.132, 0.676, 0.368, 0.383, 0.272, 0.113, 0.974, 
  0.696, 0.941, 0.751, 0.758, 0.29, 0.15)), .Names = c("IndID", 
  "Date", "Event", "Number", "Percent"), row.names = c(NA, 25L), 
  class = "data.frame")

4 个答案:

答案 0 :(得分:5)

基础R版本:

do.call(rbind,
  by(
    PostData,
    PostData$IndID,
    function(x) {
      out <- merge(
        data.frame(
          IndID=x$IndID[1],
          Date=seq.POSIXt(min(x$Date),max(x$Date),by="1 day")
        ),
        x,
        all.x=TRUE
      )
      out$Event[is.na(out$Event)] <- 0
      out
    }  
  )
)

结果:

       IndID       Date Event Number Percent
P01.1    P01 2011-03-04     1      2   0.390
P01.2    P01 2011-03-05     0     NA      NA
P01.3    P01 2011-03-06     0     NA      NA
P01.4    P01 2011-03-07     0     NA      NA
P01.5    P01 2011-03-08     0     NA      NA
P01.6    P01 2011-03-09     0     NA      NA
P01.7    P01 2011-03-10     0     NA      NA
P01.8    P01 2011-03-11     1      2   0.975
<<etc>>

答案 1 :(得分:3)

试试这个..这将添加缺少日期的正确ID和剩余字段为0

library(data.table)
library(plyr)
dtPostData = data.table(PostData)
minmaxTab = dtPostData[,list(minDate=min(Date),maxDate=max(Date)),by=IndID]

df = lapply(1:nrow(minmaxTab),function(x) {
  temp = seq(minmaxTab$minDate[x],minmaxTab$maxDate[x],by=24*60*60) 
  temp = temp[!(temp %in% dtPostData[IndID == minmaxTab$IndID[x],]$Date)]
  data.table(IndID = minmaxTab$IndID[x], Date = temp, Event = 0, Number = 0, Percent = 0)
})

df <- ldply(x, data.frame)
df

#Results
   IndID       Date Event Number Percent
1    P01 2011-03-05     0      0       0
2    P01 2011-03-06     0      0       0
3    P01 2011-03-07     0      0       0
4    P01 2011-03-08     0      0       0
5    P01 2011-03-09     0      0       0
6    P01 2011-03-10     0      0       0
7    P01 2011-03-12     0      0       0
8    P01 2011-03-16     0      0       0
9    P03 2011-07-10     0      0       0

答案 2 :(得分:2)

她的dplyr解决方案。结果,基于样本数据,是一个包含89行的data.frame,我希望这是你想要得到的。

require(dplyr)

PostData %>%
  mutate(Date = as.Date(as.character(Date))) %>%
  group_by(IndID) %>%
  do(left_join(data.frame(IndID = .$IndID[1], Date = seq(min(.$Date), max(.$Date), 1)), ., 
                       by=c("IndID", "Date"))) %>%
  mutate(Event = ifelse(is.na(Event), 0, Event))

#   IndID       Date Event Number Percent
#1    P01 2011-03-04     1      2   0.390
#2    P01 2011-03-05     0     NA      NA
#3    P01 2011-03-06     0     NA      NA
#4    P01 2011-03-07     0     NA      NA
#5    P01 2011-03-08     0     NA      NA
#6    P01 2011-03-09     0     NA      NA 
#7    P01 2011-03-10     0     NA      NA
#8    P01 2011-03-11     1      2   0.975
#...
#84   P06 2012-01-25     0     NA      NA
#85   P06 2012-01-26     0     NA      NA
#86   P06 2012-01-27     1      4   0.758
#87   P06 2012-01-28     0     NA      NA
#88   P06 2012-01-29     0      8   0.290
#89   P06 2012-01-30     0      1   0.150

答案 3 :(得分:0)

计算最小和最大时间(自纪元以来的秒数):

min_time = as.integer(min(PostData$Date))
max_time = as.integer(max(PostData$Date))

使用序列构建缺失日期列表:

list_of_dates = seq(min_time,max_time, 86400) #since there are 86400 seconds in a day
list_of_dates = as.Date(as.POSIXct( list_of_dates ), origin = '1970-01-01 00:00.00 UTC') 
#convert back to a date

构建缺失的IndID和日期组合的列表

temp = merge(unique(PostData$IndID),list_of_dates)
names(temp) = c("IndID","Date")
data_missing_indID_date = temp[!which(temp$IndID %in% PostData$IndID & temp$Date %in% PostData$Date ),]

构建其余列:

data_missing_indID_date$Event = 0 
data_missing_indID_date$Number = NA
data_missing_indID_date$Percent = NA

rbind它到原始数据框:

final_data = rbind(PostData, data_missing_indID_date)