根据两列中的信息合并两个数据集

时间:2014-04-05 20:43:12

标签: r

我有两个这样的大数据集:

df1 <- data.frame(subject = c(rep(1, 15), rep(2, 14)), day =c(0,0,1,1,1,2,3,15,15,16,16,17,17,18,19,0,0,1,1,2,3,15,15,16,16,17,17,18,19),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/17/2012 7:22','4/17/2012 7:45','4/17/2012 8:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/2/2012 6:28','5/2/2012 7:00','5/3/2012 6:22','5/3/2012 7:00','5/4/2012 6:26','5/5/2012 6:47','4/23/2012 5:56','4/23/2012 6:30','4/24/2012 6:55','4/24/2012 7:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/9/2012 5:55','5/9/2012 6:30','5/10/2012 5:55','5/10/2012 6:30','5/11/2012 6:41','5/12/2012 6:46'))

df2 <- data.frame(subject = c(rep(1, 10), rep(2, 10)), day =c(1,1,2,3,9,12,15,15,16,17,1,1,2,3,9,13,15,15,16,17),dtime=c('4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/25/2012 7:15','4/28/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','5/2/2012 7:00','5/6/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45'))

...

我想合并两个数据集,以便df2中的'dtime'可以匹配df1中的'subject'和'day',并用'。'填充缺少的值。在df1中,行号应与df1相同。预期的输出应如下所示:

merged <- data.frame(subject = c(rep(1, 15), rep(2, 14)), day =c(0,0,1,1,1,2,3,15,15,16,16,17,17,18,19,0,0,1,1,2,3,15,15,16,16,17,17,18,19),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/17/2012 7:22','4/17/2012 7:45','4/17/2012 8:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/2/2012 6:28','5/2/2012 7:00','5/3/2012 6:22','5/3/2012 7:00','5/4/2012 6:26','5/5/2012 6:47','4/23/2012 5:56','4/23/2012 6:30','4/24/2012 6:55','4/24/2012 7:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/9/2012 5:55','5/9/2012 6:30','5/10/2012 5:55','5/10/2012 6:30','5/11/2012 6:41','5/12/2012 6:46'),dtime =c('.','.','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','.','.','.','.','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','.','.'))

...

我尝试使用merge(df1, df2, by = c('subject', 'day')),但它效果不佳,它产生了我不想要的额外行。

有没有人知道要实现这个目标?

1 个答案:

答案 0 :(得分:2)

这似乎有效。

result <- merge(df1,unique(df2),by=c("subject","day"),all.x=T)
result$dtime <- as.character(result$dtime)
result[is.na(result$dtime),]$dtime="."

一些注意事项:

  1. 您不需要by=...中的merge(...)参数,因为默认情况下是在所有公共列上合并(在您的情况下,subject和{{1} }})。为了清楚起见,我把它包括在内。
  2. 另一个答案产生了额外的列,因为day中的某些行是重复的。在这种情况下,我们可以使用df2处理它,但通常这是一个更大问题的症状。你应该真正研究为什么有重复的行......
  3. 您设置的方式,unique(...)是一个因素。在将NA设置为其他内容之前,您必须将其转换为字符。
  4. 最后,如果您的数据集确实很大(数百万行),那么请考虑使用数据表。这将更快

    dtime