我有两个数据框:
dfA
" ID from to Lith
1 BG1 0 0.5 SED
2 BG1 0.5 0.6 GDI
3 BG1 0.6 2.8 GRN
3 ZH4 0 0.7 GRN
4 ZH4 0.7 3.0 GDI
dfB
" ID from to Weath
1 BG1 0 0.8 HW
2 BG1 0.8 1.5 SW
3 BG1 1.5 2.6 HW
4 ZH4 0 0.3 HW
5 ZH4 0.3 2.6 SW
我想要来自' Lith'在dfA中,dfB中的重叠百分比(从'到')。结果应如下所示:
dfC
" ID from to Weath GRN GDI SED
1 BG1 0 0.8 HW 0.25 0.125 0.625
2 BG1 0.8 1.5 SW 1 0 0
3 BG1 1.5 2.6 HW 1 0 0
4 ZH4 0 0.3 HW 1 0 0
5 ZH4 0.3 2.6 SW 0.1739 0.8261 0
请注意,dfA的间隔与dfB的间隔不一致,并且只应检查相同ID的重叠。另请注意,在一个dfB间隔中最多可以有三个重叠。 dfA的间隔总是大于dfB。
到目前为止,我的尝试导致死路一条。由ID分割df不是一种选择,因为原始数据量是巨大的。
答案 0 :(得分:3)
这是一个可能的foverlaps
解决方案
library(data.table)
setkey(setDT(dfA), ID, from, to)
setkey(setDT(dfB), ID, from, to)
res <- foverlaps(dfA, dfB)[, overlap := (pmin(to, i.to) - pmax(from, i.from))/(to - from)]
dcast(res, ID + from + to + Weath ~ Lith, value.var = "overlap", fill = 0)
# ID from to Weath GDI GRN SED
# 1: BG1 0.0 0.8 HW 0.125000 0.250000 0.625
# 2: BG1 0.8 1.5 SW 0.000000 1.000000 0.000
# 3: BG1 1.5 2.6 HW 0.000000 1.000000 0.000
# 4: ZH4 0.0 0.3 HW 0.000000 1.000000 0.000
# 5: ZH4 0.3 2.6 SW 0.826087 0.173913 0.000
key
ID
和间隔(nessacery以便foverlpas
了解要操作的列)
foverlaps
功能以识别重叠overlap
变量dcast
根据感兴趣的列答案 1 :(得分:1)
我会一次处理一个Lith(GRN,GDI,SED)的每个值,将生成的列添加到dfC
。对于Lith的每个值,我会首先使用dfA
函数找到与dfB
的每一行对应的match
行(这是行索引r
的向量下面的get.col
函数。然后我将使用pmax
和pmin
以矢量化方式计算归一化重叠(这很重要,因为你说你有一个大数据集)。
get.col <- function(lith) {
r <- match(paste(dfB$ID, lith), paste(dfA$ID, dfA$Lith))
out <- pmax(0, pmin(dfA$to[r], dfB$to) - pmax(dfA$from[r], dfB$from)) / # Overlap
(dfB$to - dfB$from) # Size of interval in dfB
out[is.na(out)] <- 0 # Unmatched rows have no overlap
out
}
dfC <- dfB
for (lith in unique(dfA$Lith)) {
dfC[,lith] <- get.col(lith)
}
dfC
# ID from to Weath SED GDI GRN
# 1 BG1 0.0 0.8 HW 0.625 0.125000 0.250000
# 2 BG1 0.8 1.5 SW 0.000 0.000000 1.000000
# 3 BG1 1.5 2.6 HW 0.000 0.000000 1.000000
# 4 ZH4 0.0 0.3 HW 0.000 0.000000 1.000000
# 5 ZH4 0.3 2.6 SW 0.000 0.826087 0.173913
答案 2 :(得分:1)
合并表格,执行重叠功能,根据需要重塑。
library(reshape2)
m<-merge(dfB,dfA,by="ID",suffixes=c("",".y"))
overlap<-function(L1,R1,L2,R2) pmax(0,pmin(R1,R2)-pmax(L1,L2))
m$value<-overlap(m$from,m$to,m$from.y,m$to.y)/(m$to-m$from)
dcast(m,ID+from+to+Weath~Lith)
#> ID from to Weath GDI GRN SED
#> 1 BG1 0.0 0.8 HW 0.125000 0.250000 0.625
#> 2 BG1 0.8 1.5 SW 0.000000 1.000000 0.000
#> 3 BG1 1.5 2.6 HW 0.000000 1.000000 0.000
#> 4 ZH4 0.0 0.3 HW 0.000000 1.000000 NA
#> 5 ZH4 0.3 2.6 SW 0.826087 0.173913 NA