将重叠间隔的信息包含在data.frame中

时间:2015-09-03 17:47:12

标签: r

我有两个数据框:

dfA
"   ID  from    to  Lith
1   BG1 0       0.5 SED
2   BG1 0.5     0.6 GDI
3   BG1 0.6     2.8 GRN
3   ZH4 0       0.7 GRN
4   ZH4 0.7     3.0 GDI

dfB
"   ID  from    to  Weath
1   BG1 0       0.8 HW
2   BG1 0.8     1.5 SW
3   BG1 1.5     2.6 HW
4   ZH4 0       0.3 HW
5   ZH4 0.3     2.6 SW

我想要来自' Lith'在dfA中,dfB中的重叠百分比(从'到')。结果应如下所示:

dfC
"   ID  from    to  Weath   GRN     GDI     SED
1   BG1 0       0.8 HW      0.25    0.125   0.625
2   BG1 0.8     1.5 SW      1       0       0
3   BG1 1.5     2.6 HW      1       0       0
4   ZH4 0       0.3 HW      1       0       0
5   ZH4 0.3     2.6 SW      0.1739  0.8261  0

请注意,dfA的间隔与dfB的间隔不一致,并且只应检查相同ID的重叠。另请注意,在一个dfB间隔中最多可以有三个重叠。 dfA的间隔总是大于dfB。

到目前为止,我的尝试导致死路一条。由ID分割df不是一种选择,因为原始数据量是巨大的。

3 个答案:

答案 0 :(得分:3)

这是一个可能的foverlaps解决方案

library(data.table)
setkey(setDT(dfA), ID, from, to)
setkey(setDT(dfB), ID, from, to)
res <- foverlaps(dfA, dfB)[, overlap := (pmin(to, i.to) - pmax(from, i.from))/(to - from)]
dcast(res, ID + from + to + Weath ~ Lith, value.var = "overlap", fill = 0)
#     ID from  to Weath      GDI      GRN   SED
# 1: BG1  0.0 0.8    HW 0.125000 0.250000 0.625
# 2: BG1  0.8 1.5    SW 0.000000 1.000000 0.000
# 3: BG1  1.5 2.6    HW 0.000000 1.000000 0.000
# 4: ZH4  0.0 0.3    HW 0.000000 1.000000 0.000
# 5: ZH4  0.3 2.6    SW 0.826087 0.173913 0.000
    {li> key ID和间隔(nessacery以便foverlpas了解要操作的列)
  • 运行foverlaps功能以识别重叠
  • 根据您的规则定义overlap变量
  • 最后,dcast根据感兴趣的列
  • 得出结果

答案 1 :(得分:1)

我会一次处理一个Lith(GRN,GDI,SED)的每个值,将生成的列添加到dfC。对于Lith的每个值,我会首先使用dfA函数找到与dfB的每一行对应的match行(这是行索引r的向量下面的get.col函数。然后我将使用pmaxpmin以矢量化方式计算归一化重叠(这很重要,因为你说你有一个大数据集)。

get.col <- function(lith) {
  r <- match(paste(dfB$ID, lith), paste(dfA$ID, dfA$Lith))
  out <- pmax(0, pmin(dfA$to[r], dfB$to) - pmax(dfA$from[r], dfB$from)) /  # Overlap
    (dfB$to - dfB$from)  # Size of interval in dfB
  out[is.na(out)] <- 0  # Unmatched rows have no overlap
  out
}

dfC <- dfB
for (lith in unique(dfA$Lith)) {
  dfC[,lith] <- get.col(lith)
}
dfC
#    ID from  to Weath   SED      GDI      GRN
# 1 BG1  0.0 0.8    HW 0.625 0.125000 0.250000
# 2 BG1  0.8 1.5    SW 0.000 0.000000 1.000000
# 3 BG1  1.5 2.6    HW 0.000 0.000000 1.000000
# 4 ZH4  0.0 0.3    HW 0.000 0.000000 1.000000
# 5 ZH4  0.3 2.6    SW 0.000 0.826087 0.173913

答案 2 :(得分:1)

合并表格,执行重叠功能,根据需要重塑。

library(reshape2)
m<-merge(dfB,dfA,by="ID",suffixes=c("",".y"))
overlap<-function(L1,R1,L2,R2) pmax(0,pmin(R1,R2)-pmax(L1,L2))
m$value<-overlap(m$from,m$to,m$from.y,m$to.y)/(m$to-m$from)
dcast(m,ID+from+to+Weath~Lith)

#>    ID from  to Weath      GDI      GRN   SED
#> 1 BG1  0.0 0.8    HW 0.125000 0.250000 0.625
#> 2 BG1  0.8 1.5    SW 0.000000 1.000000 0.000
#> 3 BG1  1.5 2.6    HW 0.000000 1.000000 0.000
#> 4 ZH4  0.0 0.3    HW 0.000000 1.000000    NA
#> 5 ZH4  0.3 2.6    SW 0.826087 0.173913    NA