根据落在columnA:columnB范围内的值在数据框中查找对应的行

时间:2019-02-20 13:37:58

标签: r dataframe

我有一个data.frame和一个像这样的向量:

 df = data.frame(id = 1:3,
                 start = c(1, 1000, 16000), 
                 end = c(100, 1100, 16100), 
                 info = c("a", "b", "c"))

vec = cbind(id= 1:150, pos=c(sample(1:100, 50), 
                             sample(1000:1100, 50), 
                             sample(1600:16100, 50)))

对于vec的每个值,我想在df中找到对应的行,其中:

  • vec$pos >= df$start
  • vec$pos <= df$end
  • vec$id == df$id

所以我可以提取info列。

问题是df长1000行,vec长200万个值。因此,使用sapply遍历vec很慢。任何人都可以通过遍历df来做到这一点吗?

2 个答案:

答案 0 :(得分:3)

您可以从vec开始间隔并使用data.table::foverlaps

library(data.table)

# Make df a data.table and set key
setDT(df)
setkey(df, start, end)

# Turn vector into a data.table with start and end
vec <- data.table(start = vec, end = vec)
setkey(vec, start, end)

# Apply overlaps for each vec entry
# This will get only those vec entries that overlap with df
foverlaps(vec, df, nomatch = NULL)

# Or if you want only info and vec column use:
foverlaps(vec, df, mult = "first", nomatch = NULL)[, .(info, vec = i.start)]

我在虚拟数据(与OP相同的尺寸)上对其进行了测试,并且需要几秒钟的时间。

df <- data.table(start = sample(1:1e7, 1e3),
                 info  = sample(letters, 1e3, replace = TRUE))
df$end <- df$start + 10
setkey(df, start, end)

vec <- sample(2e6)
vec <- data.table(start = vec, end = vec)
setkey(vec, start, end)

microbenchmark::microbenchmark(
    foverlaps(vec, df, mult = "first", nomatch = NULL)
)

# Unit: seconds
#                                               expr      min       lq     mean   median       uq     max neval
# foverlaps(vec, df, mult = "first", nomatch = NULL) 4.255962 4.274029 4.304148 4.294534 4.329679 4.45406   100

答案 1 :(得分:1)

bife_1 <- bife(CVC_dummy~at_log | year, data=own, model="logit")
summary(bife_1)

        Estimate  Std. error  t-value  Pr(> t)    
at_log   0.4679    0.0110      42.54   <2e-16 ***

这将更新您的sapply(1:nrow(df),function(x){ i=which(vec>df$start[x] & vec<df$end[x]) vec[i]<<-df$info[x] }) ,以使信息在每个位置上