匹配两个data.tables的行以填充data.table的子集

时间:2018-07-13 08:15:08

标签: r data.table match

我有两个数据表:

> DT1 <- data.table(col1 = c("a","b","b","a","c","b","a","c")
                  , col2 = c("b","d","c","a","d","a","c","a")
                  , col3 = c(1,2,3,4,5,6,7,8))
> DT2 <- data.table(col1 = c("b","e","c","e","b","c","d","a")
                  , col2 = c("d","b","c","d","a","a","c","a")
                  , col3 = c(NA,1,2,NA,6,NA,3,NA))

> DT1
   col1 col2 col3
1:    a    b    1
2:    b    d    2
3:    b    c    3
4:    a    a    4
5:    c    d    5
6:    b    a    6
7:    a    c    7
8:    c    a    8

> DT2
   col1 col2 col3
1:    b    d   NA
2:    e    b    1
3:    c    c    2
4:    e    d   NA
5:    b    a    6
6:    c    a   NA
7:    d    c    3
8:    a    a   NA

我想使用col1和col2将col3为NA的DT2行与DT1的行匹配,如果存在匹配,用DT1中的行填充DT2中col3的NA值。

> #desired Output
> DT2_output
   col1 col2 col3
1:    b    d    2
2:    e    b    1
3:    c    c    2
4:    e    d   NA
5:    b    a    6
6:    c    a    8
7:    d    c    3
8:    a    a    4

我如何使用简洁的data.table操作(无循环)来执行此操作,因为每个data.table中有数百万行。 我尝试了以下操作,但它给了我错误,我认为这与what语句有关。

> ##doesn't work
> DT2[is.na(col3), col3 := DT1[which(col1 == DT2[is.na(col3),col1] && col2 == DT2[is.na(col3), col2]), col3]]

1 个答案:

答案 0 :(得分:2)

我可以直接进行左联接,然后根据原始col3是否缺失来确定ifelse条件,如下所示:

DT2new <- merge(DT2, DT1, by = c("col1", "col2"), all.x = T)
DT2new[, col3 := ifelse(is.na(col3.x), col3.y, col3.x)]
DT2new <- DT2new[, .(col1, col2, col3)]

#   col1 col2 col3
#1:    a    a    4
#2:    b    a    6
#3:    b    d    2
#4:    c    a    8
#5:    c    c    2
#6:    d    c    3
#7:    e    b    1
#8:    e    d   NA

或者,一种更有效的方法是通过引用进行操作,该操作直接修改(通过引用代替)DT2数据。表:

DT2[DT1, on = .(col1, col2), col3 := i.col3]

#   col1 col2 col3
#1:    b    d    2
#2:    e    b    1
#3:    c    c    2
#4:    e    d   NA
#5:    b    a    6
#6:    c    a    8
#7:    d    c    3
#8:    a    a    4

i中的i.col3引用了DT[i, j, by]中的i,因此引用了col3中的列DT1。之所以可行,是因为在任何情况下,两个data.tables中的col1和col2列都不会丢失。如果遇到这种情况,可以执行以下更通用的操作(包括data.tables DT1和DT2的示例):

DT1 <- data.table(col1 = c("a","b","b","a","c","b","a","c", "e"), 
                  col2 = c("b","d","c","a","d","a","c","a", "b"),
                  col3 = c(1,2,3,4,5,6,7,8, 22))
DT2 <- data.table(col1 = c("b","e","c","e","b","c","d","a"),
                  col2 = c("d","b","c","d","a","a","c","a"),
                  col3 = c(NA,1,2,NA,6,NA,3,NA))

您会看到DT1的{​​{1}}和col1 = "e"的值为22。col2 = "b"的{​​{1}}和DT2的a值1。为了在发生这种冲突时优先使用col1 = "e",您可以这样做:

col2 = "b"

哪个给你

DT2

DT2[DT1, on = .(col1, col2), col3 := ifelse(is.na(x.col3), i.col3, x.col3)] 中的# col1 col2 col3 #1: b d 2 #2: e b 1 #3: c c 2 #4: e d NA #5: b a 6 #6: c a 8 #7: d c 3 #8: a a 4 x.中的x.col3

编辑:矢量化data.table操作方法是一种更有效的通用方法(如果DT1自己包含一对col1和col2对的值)

鉴于有用的评论(@Frank和@ chinsoon12),我再次检查了提供的解决方案。如前所述,ifelse可能会变慢(注释中的原因),这就是为什么向量化解决方案是更好的方法:

col3