我有许多需要连接的表。但是,在某些单元格中,该值为NA,需要与每个可能的值匹配。
在SQL中,可能类似于:
SELECT * FROM A
LEFT JOIN B
ON (A.KEY1 = B.KEY1 OR B.KEY1 IS NULL)
AND (A.KEY2 = B.KEY2 OR B.KEY2 IS NULL) # Repeated for every other column
我可以通过执行许多联接来解决此问题,例如:
B[A, on = .(Key1, Key2, Key3), Var = i.Var]
B[A[is.na(Key2), ], on = .(Key1, Key3), Var = i.Var]
B[A[is.na(Key3), ], on = .(Key1, Key2), Var = i.Var]
B[A[is.na(Key2) & is.na(Key3), ], on = .(Key1), Var = i.Var]
B[A[is.na(Key1), ], on = .(Key2, Key3), Var = i.Var]
B[A[is.na(Key1) & is.na(Key2), ], on = .(Key3), Var = i.Var]
B[A[is.na(Key1) & is.na(Key3), ], on = .(Key2), Var = i.Var]
但是,这似乎不是最好的方法,尤其是随着列数的增加。上面仅需要3列就需要7个更新联接。
例如,如果我有一张桌子,该桌子的名字与某人的描述(他们居住的城市,头发的颜色,身高)相匹配:
观察到的数据:
a <- data.table(id = c(1, 2, 3),
city = c("city1", "city2", "city2"),
height = c("tall", "tall", "short"),
hair = c("black", "black", "blonde"))
id city height hair name
1: 1 city1 tall black dave
2: 2 city2 tall black william
3: 3 city2 short blonde jack
要匹配的表:
b <- data.table(city = c("city1", "city1", "city2", "city2"),
height = c("tall", "tall", "short", "tall"),
hair = c("black", "blonde", "blonde", "black"),
name = c("dave", "harry", "jack", "william"))
city height hair name
1: city1 tall black dave
2: city1 tall blonde harry
3: city2 short blonde jack
4: city2 tall black william
加入他们:
b[a, on = .(city, height, hair), .(id, city, height, hair, name)]
id city height hair name
1: 1 city1 tall black dave
2: 2 city2 tall black william
3: 3 city2 short blonde jack
这是预期的。我需要它,以便某些字段丢失,例如:
city height hair name
1: city1 NA black dave
2: city1 NA blonde harry
3: city2 short NA jack
4: city2 tall black william
它仍然应该产生相同的输出
在data.table框架内是否有有效的方法?
谢谢
编辑:
为了更清楚一点,如果表b为
b <- data.table(city = c("city1", "city1", "city2", "city2"),
height = c(NA, "tall", "short", "tall"),
hair = c("black", "blonde", "blonde", "black"),
name = c("dave", "harry", "jack", "william"))
然后该联接仅产生:
id city height hair name
1: 1 city1 tall black NA
2: 2 city2 tall black william
3: 3 city2 short blonde jack
何时应产生:
id city height hair name
1: 1 city1 tall black dave
2: 2 city2 tall black william
3: 3 city2 short blonde jack
NA应与任何值匹配的“通配符”对待。
EDIT2:
我发现的第二种解决方法是通过笛卡尔先连接表:
ab <- a[, as.list(b), by = .(id, i.city = city, i.height = height, i.hair)]
id i.city i.height i.hair city height hair NAME
1: 1 city1 tall black city1 NA black dave
2: 1 city1 tall black city1 tall blonde harry
3: 1 city1 tall black city2 short blonde jack
4: 1 city1 tall black city2 tall black william
5: 2 city2 tall black city1 NA black dave
6: 2 city2 tall black city1 tall blonde harry
7: 2 city2 tall black city2 short blonde jack
8: 2 city2 tall black city2 tall black william
9: 3 city2 short blonde city1 NA black dave
10: 3 city2 short blonde city1 tall blonde harry
11: 3 city2 short blonde city2 short blonde jack
12: 3 city2 short blonde city2 tall black william
然后在以下条件下应用我的条件:
ab[(i.city == city | is.na(city))
& (i.height == height | is.na(height))
& (i.hair == hair | is.na(hair))]
id i.city i.height i.hair city height hair name
1: 1 city1 tall black city1 NA black dave
2: 2 city2 tall black city2 tall black william
3: 3 city2 short blonde city2 short blonde jack
虽然使用大型数据集时,我不确定像这样的笛卡尔连接是否是最好的方法。
答案 0 :(得分:1)
我想到的效率最低的方法是简单地扩展B,以便以后可以进行普通联接。
library(data.table)
a <- data.table(id = c(1, 2, 3),
city = c("city1", "city2", "city2"),
height = c("tall", "tall", "short"),
hair = c("black", "black", "blonde"))
a_unique <- a[, lapply(.SD, function(x) { list(unique(x)) })]
b <- data.table(city = c("city1", "city1", "city2", "city2"),
height = c(NA, "tall", "short", NA),
hair = c("black", NA, "blonde", NA),
name = c("dave", "harry", "jack", "william"))
harmonize <- function(mat) {
ans <- as.vector(t(mat))
ans[!is.na(ans)]
}
expand_recursively <- function(dt, cols) {
if (length(cols) == 0L) return(dt)
current <- cols[1L]
next_cols <- cols[-1L]
not_current <- setdiff(names(dt), current)
na_class <- class(a_unique[[current]][[1L]])
expanded <- data.table(as(NA, na_class), all = a_unique[[current]][[1L]])
setnames(expanded, c(current, "all"))
next_dt <- expanded[dt,
c(list(harmonize(as.matrix(.SD))), mget(not_current)),
on = current,
.SDcols = c(current, "all"),
allow = TRUE]
setnames(next_dt, "V1", current)
expand_recursively(next_dt, next_cols)
}
b_expanded <- expand_recursively(b, intersect(names(a), names(b)))
setcolorder(b_expanded, names(b))
b
city height hair name
1: city1 <NA> black dave
2: city1 tall <NA> harry
3: city2 short blonde jack
4: city2 <NA> <NA> william
b_expanded
city height hair name
1: city1 tall black dave
2: city1 short black dave
3: city1 tall black harry
4: city1 tall blonde harry
5: city2 short blonde jack
6: city2 tall black william
7: city2 tall blonde william
8: city2 short black william
9: city2 short blonde william
我认为有问题的事情可能是在计算a_unique
。
如果您知道可以用于匹配的值,
也许您可以直接在expand_recursively
中指定它们。