我试图移除data.frame
中的行,其中posn
列中的值不在另一个data.frame
中给出的范围内,而data.table
的非等值加入功能。
以下是我的数据:
library(data.table)
df.cov <-
structure(list(posn = c(1, 2, 3, 165, 1000), att = c("a", "b",
"c", "d", "e")), .Names = c("posn", "att"), row.names = c(NA,
-5L), class = "data.frame")
df.exons <-
structure(list(start = c(2889, 2161, 277, 164, 1), end = c(3329,
2826, 662, 662, 168)), .Names = c("start", "end"), row.names = c(NA,
-5L), class = "data.frame")
setDT(df.cov)
setDT(df.exons)
df.cov
# posn att
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 165 d
# 5: 1000 e
df.exons # ranges of `posn` to include
# start end
# 1: 2889 3329
# 2: 2161 2826
# 3: 277 662
# 4: 164 662
# 5: 1 168
以下是我的尝试:
df.cov[df.exons, on = .(posn >= start, posn <= end), nomatch = 0]
# posn att posn.1
# 1: 164 d 662
# 2: 1 a 168
# 3: 1 b 168
# 4: 1 c 168
# 5: 1 d 168
您可以看到posn
中的df.cov
列也已更改。预期结果如下:
# posn att
# 1: 165 d
# 2: 1 a
# 3: 2 b
# 4: 3 c
# 5 165 d
# the row order doesn't matter. I'll sort by posn latter.
# It is also fine if the duplicated rows are removed, otherwise I'll do this in next step.
如何使用data.table
非等连接获得所需的输出?
答案 0 :(得分:7)
您也可以使用%inrange%
:
df.cov[posn %inrange% df.exons]
导致:
posn att 1: 1 a 2: 2 b 3: 3 c 4: 165 d
如您所见,这会使posn
- 列的值保持不变。
另一个,虽然更长,但可能性:
df.exons[df.cov
, on = .(start <= posn, end >= posn)
, mult ='first'
, nomatch = 0
, .(posn = i.posn, att)][]