我有两个数据表:
> DT1 <- data.table(col1 = c("a","b","b","a","c","b","a","c")
, col2 = c("b","d","c","a","d","a","c","a")
, col3 = c(1,2,3,4,5,6,7,8))
> DT2 <- data.table(col1 = c("b","e","c","e","b","c","d","a")
, col2 = c("d","b","c","d","a","a","c","a")
, col3 = c(NA,1,2,NA,6,NA,3,NA))
> DT1
col1 col2 col3
1: a b 1
2: b d 2
3: b c 3
4: a a 4
5: c d 5
6: b a 6
7: a c 7
8: c a 8
> DT2
col1 col2 col3
1: b d NA
2: e b 1
3: c c 2
4: e d NA
5: b a 6
6: c a NA
7: d c 3
8: a a NA
我想使用col1和col2将col3为NA的DT2行与DT1的行匹配,如果存在匹配,用DT1中的行填充DT2中col3的NA值。
> #desired Output
> DT2_output
col1 col2 col3
1: b d 2
2: e b 1
3: c c 2
4: e d NA
5: b a 6
6: c a 8
7: d c 3
8: a a 4
我如何使用简洁的data.table操作(无循环)来执行此操作,因为每个data.table中有数百万行。 我尝试了以下操作,但它给了我错误,我认为这与what语句有关。
> ##doesn't work
> DT2[is.na(col3), col3 := DT1[which(col1 == DT2[is.na(col3),col1] && col2 == DT2[is.na(col3), col2]), col3]]
答案 0 :(得分:2)
我可以直接进行左联接,然后根据原始col3是否缺失来确定ifelse条件,如下所示:
DT2new <- merge(DT2, DT1, by = c("col1", "col2"), all.x = T)
DT2new[, col3 := ifelse(is.na(col3.x), col3.y, col3.x)]
DT2new <- DT2new[, .(col1, col2, col3)]
# col1 col2 col3
#1: a a 4
#2: b a 6
#3: b d 2
#4: c a 8
#5: c c 2
#6: d c 3
#7: e b 1
#8: e d NA
或者,一种更有效的方法是通过引用进行操作,该操作直接修改(通过引用代替)DT2数据。表:
DT2[DT1, on = .(col1, col2), col3 := i.col3]
# col1 col2 col3
#1: b d 2
#2: e b 1
#3: c c 2
#4: e d NA
#5: b a 6
#6: c a 8
#7: d c 3
#8: a a 4
i
中的i.col3
引用了DT[i, j, by]
中的i,因此引用了col3
中的列DT1
。之所以可行,是因为在任何情况下,两个data.tables中的col1和col2列都不会丢失。如果遇到这种情况,可以执行以下更通用的操作(包括data.tables DT1和DT2的示例):
DT1 <- data.table(col1 = c("a","b","b","a","c","b","a","c", "e"),
col2 = c("b","d","c","a","d","a","c","a", "b"),
col3 = c(1,2,3,4,5,6,7,8, 22))
DT2 <- data.table(col1 = c("b","e","c","e","b","c","d","a"),
col2 = c("d","b","c","d","a","a","c","a"),
col3 = c(NA,1,2,NA,6,NA,3,NA))
您会看到DT1
的{{1}}和col1 = "e"
的值为22。col2 = "b"
的{{1}}和DT2
的a值1。为了在发生这种冲突时优先使用col1 = "e"
,您可以这样做:
col2 = "b"
哪个给你
DT2
DT2[DT1, on = .(col1, col2), col3 := ifelse(is.na(x.col3), i.col3, x.col3)]
中的# col1 col2 col3
#1: b d 2
#2: e b 1
#3: c c 2
#4: e d NA
#5: b a 6
#6: c a 8
#7: d c 3
#8: a a 4
指x.
中的x.col3
。
编辑:矢量化data.table操作方法是一种更有效的通用方法(如果DT1自己包含一对col1和col2对的值)
鉴于有用的评论(@Frank和@ chinsoon12),我再次检查了提供的解决方案。如前所述,ifelse可能会变慢(注释中的原因),这就是为什么向量化解决方案是更好的方法:
col3