我有两个这样的大数据集:
df1 <- data.frame(subject = c(rep(1, 15), rep(2, 14)), day =c(0,0,1,1,1,2,3,15,15,16,16,17,17,18,19,0,0,1,1,2,3,15,15,16,16,17,17,18,19),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/17/2012 7:22','4/17/2012 7:45','4/17/2012 8:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/2/2012 6:28','5/2/2012 7:00','5/3/2012 6:22','5/3/2012 7:00','5/4/2012 6:26','5/5/2012 6:47','4/23/2012 5:56','4/23/2012 6:30','4/24/2012 6:55','4/24/2012 7:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/9/2012 5:55','5/9/2012 6:30','5/10/2012 5:55','5/10/2012 6:30','5/11/2012 6:41','5/12/2012 6:46'))
df2 <- data.frame(subject = c(rep(1, 10), rep(2, 10)), day =c(1,1,2,3,9,12,15,15,16,17,1,1,2,3,9,13,15,15,16,17),dtime=c('4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/25/2012 7:15','4/28/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','5/2/2012 7:00','5/6/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45'))
...
我想合并两个数据集,以便df2中的'dtime'可以匹配df1中的'subject'和'day',并用'。'填充缺少的值。在df1中,行号应与df1相同。预期的输出应如下所示:
merged <- data.frame(subject = c(rep(1, 15), rep(2, 14)), day =c(0,0,1,1,1,2,3,15,15,16,16,17,17,18,19,0,0,1,1,2,3,15,15,16,16,17,17,18,19),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/17/2012 7:22','4/17/2012 7:45','4/17/2012 8:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/2/2012 6:28','5/2/2012 7:00','5/3/2012 6:22','5/3/2012 7:00','5/4/2012 6:26','5/5/2012 6:47','4/23/2012 5:56','4/23/2012 6:30','4/24/2012 6:55','4/24/2012 7:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/9/2012 5:55','5/9/2012 6:30','5/10/2012 5:55','5/10/2012 6:30','5/11/2012 6:41','5/12/2012 6:46'),dtime =c('.','.','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','.','.','.','.','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','.','.'))
...
我尝试使用merge(df1, df2, by = c('subject', 'day'))
,但它效果不佳,它产生了我不想要的额外行。
有没有人知道要实现这个目标?
答案 0 :(得分:2)
这似乎有效。
result <- merge(df1,unique(df2),by=c("subject","day"),all.x=T)
result$dtime <- as.character(result$dtime)
result[is.na(result$dtime),]$dtime="."
一些注意事项:
by=...
中的merge(...)
参数,因为默认情况下是在所有公共列上合并(在您的情况下,subject
和{{1} }})。为了清楚起见,我把它包括在内。day
中的某些行是重复的。在这种情况下,我们可以使用df2
处理它,但通常这是一个更大问题的症状。你应该真正研究为什么有重复的行...... unique(...)
是一个因素。在将NA设置为其他内容之前,您必须将其转换为字符。最后,如果您的数据集确实很大(数百万行),那么请考虑使用数据表。这将更快。
dtime