我有一个data.frame和一个像这样的向量:
df = data.frame(id = 1:3,
start = c(1, 1000, 16000),
end = c(100, 1100, 16100),
info = c("a", "b", "c"))
vec = cbind(id= 1:150, pos=c(sample(1:100, 50),
sample(1000:1100, 50),
sample(1600:16100, 50)))
对于vec
的每个值,我想在df
中找到对应的行,其中:
vec$pos >= df$start
vec$pos <= df$end
vec$id == df$id
所以我可以提取info
列。
问题是df
长1000行,vec
长200万个值。因此,使用sapply遍历vec很慢。任何人都可以通过遍历df
来做到这一点吗?
答案 0 :(得分:3)
您可以从vec
开始间隔并使用data.table::foverlaps
。
library(data.table)
# Make df a data.table and set key
setDT(df)
setkey(df, start, end)
# Turn vector into a data.table with start and end
vec <- data.table(start = vec, end = vec)
setkey(vec, start, end)
# Apply overlaps for each vec entry
# This will get only those vec entries that overlap with df
foverlaps(vec, df, nomatch = NULL)
# Or if you want only info and vec column use:
foverlaps(vec, df, mult = "first", nomatch = NULL)[, .(info, vec = i.start)]
我在虚拟数据(与OP相同的尺寸)上对其进行了测试,并且需要几秒钟的时间。
df <- data.table(start = sample(1:1e7, 1e3),
info = sample(letters, 1e3, replace = TRUE))
df$end <- df$start + 10
setkey(df, start, end)
vec <- sample(2e6)
vec <- data.table(start = vec, end = vec)
setkey(vec, start, end)
microbenchmark::microbenchmark(
foverlaps(vec, df, mult = "first", nomatch = NULL)
)
# Unit: seconds
# expr min lq mean median uq max neval
# foverlaps(vec, df, mult = "first", nomatch = NULL) 4.255962 4.274029 4.304148 4.294534 4.329679 4.45406 100
答案 1 :(得分:1)
bife_1 <- bife(CVC_dummy~at_log | year, data=own, model="logit")
summary(bife_1)
Estimate Std. error t-value Pr(> t)
at_log 0.4679 0.0110 42.54 <2e-16 ***
这将更新您的sapply(1:nrow(df),function(x){
i=which(vec>df$start[x] & vec<df$end[x])
vec[i]<<-df$info[x]
})
,以使信息在每个位置上