我对data.table
成语“非联接”有疑问,灵感来自Iterator的question。这是一个例子:
library(data.table)
dt1 <- data.table(A1=letters[1:10], B1=sample(1:5,10, replace=TRUE))
dt2 <- data.table(A2=letters[c(1:5, 11:15)], B2=sample(1:5,10, replace=TRUE))
setkey(dt1, A1)
setkey(dt2, A2)
data.table
看起来像这样
> dt1 > dt2
A1 B1 A2 B2
[1,] a 1 [1,] a 2
[2,] b 4 [2,] b 5
[3,] c 2 [3,] c 2
[4,] d 5 [4,] d 1
[5,] e 1 [5,] e 1
[6,] f 2 [6,] k 5
[7,] g 3 [7,] l 2
[8,] h 3 [8,] m 4
[9,] i 2 [9,] n 1
[10,] j 4 [10,] o 1
要查找dt2
中dt1
中具有相同密钥的哪些行,请将which
选项设置为TRUE
:
> dt1[dt2, which=TRUE]
[1] 1 2 3 4 5 NA NA NA NA NA
Matthew在这个answer中建议,这是一个“非联合”成语
dt1[-dt1[dt2, which=TRUE]]
将dt1
子集化到那些索引未显示在dt2
中的行。在data.table
v1.7.1的机器上,我收到错误:
Error in `[.default`(x[[s]], irows): only 0's may be mixed with negative subscripts
相反,使用选项nomatch=0
,“非联接”有效
> dt1[-dt1[dt2, which=TRUE, nomatch=0]]
A1 B1
[1,] f 2
[2,] g 3
[3,] h 3
[4,] i 2
[5,] j 4
这是预期的行为吗?
答案 0 :(得分:17)
v1.8.3中的新内容:
A new "!" prefix on i signals 'not-join' (a.k.a. 'not-where'), #1384.
DT[-DT["a", which=TRUE, nomatch=0]] # old not-join idiom, still works
DT[!"a"] # same result, now preferred.
DT[!J(6),...] # !J == not-join
DT[!2:3,...] # ! on all types of i
DT[colA!=6L | colB!=23L,...] # multiple vector scanning approach
DT[!J(6L,23L)] # same result, faster binary search
'!' has been used rather than '-' :
* to match the 'not-join' and 'not-where' nomenclature
* with '-', DT[-0] would return DT rather than DT[0] and not be backwards
compatibile. With '!', DT[!0] returns DT both before (since !0 is TRUE in
base R) and after this new feature.
* to leave DT[+...] and DT[-...] available for future use
答案 1 :(得分:5)
据我所知,这是基地R的一部分。
# This works
(1:4)[c(-2,-3)]
# But this gives you the same error you described above
(1:4)[c(-2, -3, NA)]
# Error in (1:4)[c(-2, -3, NA)] :
# only 0's may be mixed with negative subscripts
文字错误消息表明是预期行为。
我最好的猜测是为什么这是预期的行为:
从他们在其他地方处理NA
的方式(例如通常默认为na.rm=FALSE
),似乎R的设计师将NA
视为携带重要信息,并且不愿意在没有明确指示的情况下放弃这样做。 (幸运的是,设置nomatch=0
为您提供了一种传递该指令的简洁方法!)
在这种情况下,设计师的偏好可能解释了为什么NA
被接受用于正面索引,而不是用于负索引:
# Positive indexing: works, because the return value retains info about NA's
(1:4)[c(2,3,NA)]
# Negative indexing: doesn't work, because it can't easily retain such info
(1:4)[c(-2,-3,NA)]
答案 2 :(得分:2)
data.table版本1.7.3中的新功能:
新选项
datatable.nomatch
允许nomatch的默认值 从NA变为0,......