提前致谢。
我正在尝试为三个不同的人添加未包含在观察期内的缺失日期值。
我的数据如下:
IndID Date Event Number Percent
1 P01 2011-03-04 1 2 0.390
2 P01 2011-03-11 1 2 0.975
3 P01 2011-03-13 0 9 0.795
4 P01 2011-03-14 0 10 0.516
5 P01 2011-03-15 0 1 0.117
6 P01 2011-03-17 0 7 0.093
IndID
是个人ID(P01
,P03
,P06
)。 Date
显然是日期。 Event
是一个二进制变量,指示事件是否发生(0
=否和1
=是)。
列Number
和Percent
不是直接相关的,但需要保留,因此包含在此处。
我的示例数据框(PostData
)包含在下面,使用dput
。
对于每个IndID
,第一个和最后一个Date
分别是观察期的开始和结束,其中缺少日期。在此,我的目标是为每个人添加缺少的日期,并在0
列中添加Event
。其他列(Number
和Percent
)可以保持空白。
This post一直很有用,但缺乏关于我的主要问题的信息 - 多个人。
每个人的观察期从min(PostData$Date)
到max(PostData$Date)
。我一直在尝试为每个人创建一个完整的日期序列,然后使用merge
循环中的现有数据框for
。肯定有更好的主意。
任何建议都表示赞赏。
PostData <-structure(list(IndID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L), .Label = c("P01", "P02", "P03", "P05", "P06", "P07",
"P08", "P09", "P10", "P11", "P12", "P13"), class = "factor"),
Date = structure(c(1299196800, 1299801600, 1299974400, 1300060800,
1300147200, 1300320000, 1300406400, 1310083200, 1310169600,
1310515200, 1310774400, 1310947200, 1311033600, 1311292800,
1311552000, 1323129600, 1323388800, 1323648000, 1323993600,
1324080000, 1324166400, 1324339200, 1327622400, 1327795200,
1327881600), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
Event = c(1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L,
0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L), Number = c(2L,
2L, 9L, 10L, 1L, 7L, 5L, 9L, 1L, 4L, 5L, 2L, 0L, 1L, 10L,
5L, 0L, 6L, 5L, 10L, 9L, 4L, 4L, 8L, 1L), Percent = c(0.39,
0.975, 0.795, 0.516, 0.117, 0.093, 0.528, 0.659, 0.308, 0.055,
0.185, 0.761, 0.132, 0.676, 0.368, 0.383, 0.272, 0.113, 0.974,
0.696, 0.941, 0.751, 0.758, 0.29, 0.15)), .Names = c("IndID",
"Date", "Event", "Number", "Percent"), row.names = c(NA, 25L),
class = "data.frame")
答案 0 :(得分:5)
基础R版本:
do.call(rbind,
by(
PostData,
PostData$IndID,
function(x) {
out <- merge(
data.frame(
IndID=x$IndID[1],
Date=seq.POSIXt(min(x$Date),max(x$Date),by="1 day")
),
x,
all.x=TRUE
)
out$Event[is.na(out$Event)] <- 0
out
}
)
)
结果:
IndID Date Event Number Percent
P01.1 P01 2011-03-04 1 2 0.390
P01.2 P01 2011-03-05 0 NA NA
P01.3 P01 2011-03-06 0 NA NA
P01.4 P01 2011-03-07 0 NA NA
P01.5 P01 2011-03-08 0 NA NA
P01.6 P01 2011-03-09 0 NA NA
P01.7 P01 2011-03-10 0 NA NA
P01.8 P01 2011-03-11 1 2 0.975
<<etc>>
答案 1 :(得分:3)
试试这个..这将添加缺少日期的正确ID和剩余字段为0
library(data.table)
library(plyr)
dtPostData = data.table(PostData)
minmaxTab = dtPostData[,list(minDate=min(Date),maxDate=max(Date)),by=IndID]
df = lapply(1:nrow(minmaxTab),function(x) {
temp = seq(minmaxTab$minDate[x],minmaxTab$maxDate[x],by=24*60*60)
temp = temp[!(temp %in% dtPostData[IndID == minmaxTab$IndID[x],]$Date)]
data.table(IndID = minmaxTab$IndID[x], Date = temp, Event = 0, Number = 0, Percent = 0)
})
df <- ldply(x, data.frame)
df
#Results
IndID Date Event Number Percent
1 P01 2011-03-05 0 0 0
2 P01 2011-03-06 0 0 0
3 P01 2011-03-07 0 0 0
4 P01 2011-03-08 0 0 0
5 P01 2011-03-09 0 0 0
6 P01 2011-03-10 0 0 0
7 P01 2011-03-12 0 0 0
8 P01 2011-03-16 0 0 0
9 P03 2011-07-10 0 0 0
答案 2 :(得分:2)
她的dplyr
解决方案。结果,基于样本数据,是一个包含89行的data.frame,我希望这是你想要得到的。
require(dplyr)
PostData %>%
mutate(Date = as.Date(as.character(Date))) %>%
group_by(IndID) %>%
do(left_join(data.frame(IndID = .$IndID[1], Date = seq(min(.$Date), max(.$Date), 1)), .,
by=c("IndID", "Date"))) %>%
mutate(Event = ifelse(is.na(Event), 0, Event))
# IndID Date Event Number Percent
#1 P01 2011-03-04 1 2 0.390
#2 P01 2011-03-05 0 NA NA
#3 P01 2011-03-06 0 NA NA
#4 P01 2011-03-07 0 NA NA
#5 P01 2011-03-08 0 NA NA
#6 P01 2011-03-09 0 NA NA
#7 P01 2011-03-10 0 NA NA
#8 P01 2011-03-11 1 2 0.975
#...
#84 P06 2012-01-25 0 NA NA
#85 P06 2012-01-26 0 NA NA
#86 P06 2012-01-27 1 4 0.758
#87 P06 2012-01-28 0 NA NA
#88 P06 2012-01-29 0 8 0.290
#89 P06 2012-01-30 0 1 0.150
答案 3 :(得分:0)
计算最小和最大时间(自纪元以来的秒数):
min_time = as.integer(min(PostData$Date))
max_time = as.integer(max(PostData$Date))
使用序列构建缺失日期列表:
list_of_dates = seq(min_time,max_time, 86400) #since there are 86400 seconds in a day
list_of_dates = as.Date(as.POSIXct( list_of_dates ), origin = '1970-01-01 00:00.00 UTC')
#convert back to a date
构建缺失的IndID和日期组合的列表
temp = merge(unique(PostData$IndID),list_of_dates)
names(temp) = c("IndID","Date")
data_missing_indID_date = temp[!which(temp$IndID %in% PostData$IndID & temp$Date %in% PostData$Date ),]
构建其余列:
data_missing_indID_date$Event = 0
data_missing_indID_date$Number = NA
data_missing_indID_date$Percent = NA
rbind
它到原始数据框:
final_data = rbind(PostData, data_missing_indID_date)