我有两个data.tables,我想通过一个数字变量(双精度)进行联接。但是,数值变量存在不确定性。因此,我必须允许特定的公差,该公差根据变量而有所不同。
在下面的示例中,“ mz”是我想要通过其连接DT1和DT2的变量。根据variabe iso_mz计算的公差:iso_mz * 5e-6。
DT1 <- data.table(mz = c(433.231512451172, 451.091953822545, 454.347605202415, 490.167234693255, 518.225894504123),
Var1 = c(433.231018066406, 451.091430664062, 454.347015380859, 490.166381835938, 518.22509765625),
Var2 = c(433.232147216797, 451.092559814453, 454.34814453125, 490.168273925781, 518.2265625))
DT2 <- data.table(iso_mz = c(451.0900, 490.1651, 518.2281, 433.2335),
comp = c("m1", "m2", "m3", "m4"))
如果我不必使用公差,则可以使用data.table包的“ on =。()”功能。我试图改编Joining Data Frames by Measured Values with an Error Range中的代码,但是由于某些原因,我无法运行,..
我的例证的期望输出为:
Output <- data.table(
iso_mz = c(433.2335, 451.0900, 490.1651, 518.2281),
comp = c("m4", "m1", "m2", "m3"),
mz = c(433.231512451172, 451.091953822545, 490.167234693255, 518.225894504123),
Var1 = c(433.231018066406, 451.091430664062, 490.166381835938, 518.22509765625),
Var2 = c(433.232147216797, 451.092559814453, 490.168273925781, 518.2265625))
提前谢谢!
答案 0 :(得分:1)
这是一种使用foverlaps()
中的data.table
的方法。
tolerance = 5e-6
#create ranges to join on
DT1[, `:=`(min = mz - mz * tolerance,
max = mz + mz * tolerance) ]
DT2[, `:=`(min = iso_mz - iso_mz * tolerance,
max = iso_mz + iso_mz * tolerance) ]
#set keys
setkey(DT1, min, max )
setkey(DT2, min, max )
#perform overlap join, order, remove min-max columns
ans <- setorder( foverlaps( DT2, DT1 ), mz)[, `:=`(min=NULL,max=NULL,i.min=NULL,i.max=NULL)][]
# mz Var1 Var2 iso_mz comp
# 1: 433.2315 433.2310 433.2321 433.2335 m4
# 2: 451.0920 451.0914 451.0926 451.0900 m1
# 3: 490.1672 490.1664 490.1683 490.1651 m2
# 4: 518.2259 518.2251 518.2266 518.2281 m3
#check
all.equal( setcolorder(ans, names(Output)), Output )
[1] TRUE