我有两个数据框,我使用的是findInterval。井眼数据是井筒的x,y和z产生油的数据(VSS =垂直海底深度,md =测量深度a.k.a.钻头沿井下行进的实际距离)。 Perfs数据是井眼已被穿孔以允许流动的数据(top_perf = md,bot_perf = md)。
Perfs:
Well_ID top_perf bot_perf well_name surface ID x y VSS
056-W 2808 2958 056-W Ranger 2 0 0 0
056-W 3150 3250 056-W Ranger 1 0 0 0
056-W 3150 3250 056-W Ranger 2 0 0 0
056-W 3559 3664 056-W UT 1 1 0 0 0
056-W 3559 3664 056-W UT 2 2 0 0 0
057-W 2471 2952 057-W Tar 1 0 0 0
057-W 2471 2952 057-W Tar 2 0 0 0
058-W 2615 2896 058-W Ranger 1 0 0 0
058-W 2615 2896 058-W Ranger 2 0 0 0
井筒:
well_name well_id md vss x y
056-W 056-W 3260 -3251.46 4221436 4030454
056-W 056-W 3280 -3271.45 4221436 4030454
056-W 056-W 3300 -3291.45 4221435 4030453
056-W 056-W 3320 -3311.44 4221435 4030453
056-W 056-W 3340 -3331.44 4221434 4030453
056-W 056-W 3360 -3351.43 4221434 4030453
056-W 056-W 3380 -3371.43 4221433 4030453
056-W 056-W 3400 -3391.42 4221433 4030453
目标是找到与Wellbore $ md最接近的Perfs $ top_perf和Perfs $ bot_perf,其中Perfs $ Well_ID = Wellbore $ well_id然后从Wellbore中提取vss,x和y并将其添加到Perfs。 (我不在乎内插,如果它介于两者之间,只需要一些接近的东西。)
这是我的代码:
for(i in 1:dim(Perfs)[1]){
if(Perfs$ID[i] == 1){
Wellbore_temp <- Wellbore[which(Wellbore$well_id == Perfs[i,"Well_ID"]),]
interval <- findInterval(Perfs[i,"top_perf"], Wellbore_temp$md)
Perfs[i,c("x","y","VSS")] <- Wellbore_temp[interval, c("x","y","vss")]
}else{
Wellbore_temp <- Wellbore[which(Wellbore$well_id == Perfs[i,"Well_ID"]),]
interval <- findInterval(Perfs[i,"bot_perf"], Wellbore_temp$md)
Perfs[i,c("x","y","VSS")] <- Wellbore_temp[interval, c("x","y","vss")]
}
}
这段代码确实有用,它对于将要使用的应用程序来说太慢了。如何摆脱循环并以更加矢量化的方式执行此操作以加快速度?也欢迎findInterval以外的建议。
答案 0 :(得分:1)
在此处找到问题的答案:Join R data.tables where key values are not exactly equal--combine rows with closest times
基于@ ds440
提供的data.table的想法以下是我使用的代码,运行速度非常快:
Perf.Data <- Perfs
Wellbore.Perfs <- data.table(Wellbore[,c("well_id","md","vss")])
Spotfire.Top.Perf <- data.table(Perf.Data[,c("Well_ID","top_perf", "bot_perf")])
Spotfire.Bot.Perf <- data.table(Perf.Data[,c("Well_ID","bot_perf", "top_perf")])
#Change the column names to match up with Wellbore.Perfs
#Add in the bot_perf to .top.perf and the top_perf to the .bot.perf is done to make these unique and ensure everything is captured from the perfs table
colnames(Spotfire.Top.Perf) <- c("well_id","md", "bot_perf")
colnames(Spotfire.Bot.Perf) <- c("well_id","md","top_perf")
#set key to join on
setkey(Wellbore.Perfs, "well_id","md")
#roll = "nearest" will take the nearest value of md in .top.perf or .bot.perf and match it to the md in wellbore.perfs where Well_ID = Well_ID
Perfs.Wellbore.Top <- Wellbore.Perfs[Spotfire.Top.Perf, roll = "nearest"]
Perfs.Wellbore.Bot <- Wellbore.Perfs[Spotfire.Bot.Perf, roll = "nearest"]
答案 1 :(得分:0)
下面我介绍一个data.table解决方案。我只是在您展示的小数据子集上进行了测试,而在这个小数据集上,它的解决速度比您的解决方案慢,但我认为它可能会更好地扩展。如果没有,请考虑并行化。
如果您之前没有使用过data.table,我认为它通常很快,但语法可能有点复杂。 .SD
指的是连接到perfs数据的第i行的井眼数据的子集(迭代.EACHI
)。这节省了所有事物的巨大连接。我不是使用findInterval函数,而是计算错误(top_perf - md
或bot_perf - md
)并最小化绝对错误。这种方法优于滚动连接(“最近”)的优点是您可以看到错误是什么,并在必要时进行过滤。
library(data.table)
Perfs <- fread(input = 'Well_ID top_perf bot_perf well_name surface ID x y VSS
056-W 2808 2958 056-W Ranger 2 0 0 0
056-W 3150 3250 056-W Ranger 1 0 0 0
056-W 3150 3250 056-W Ranger 2 0 0 0
056-W 3559 3664 056-W UT_1 1 0 0 0
056-W 3559 3664 056-W UT_2 2 0 0 0
057-W 2471 2952 057-W Tar 1 0 0 0
057-W 2471 2952 057-W Tar 2 0 0 0
058-W 2615 2896 058-W Ranger 1 0 0 0
058-W 2615 2896 058-W Ranger 2 0 0 0')
Wellbore <- fread(input = 'well_name well_id md vss x y
056-W 056-W 3260 -3251.46 4221436 4030454
056-W 056-W 3280 -3271.45 4221436 4030454
056-W 056-W 3300 -3291.45 4221435 4030453
056-W 056-W 3320 -3311.44 4221435 4030453
056-W 056-W 3340 -3331.44 4221434 4030453
056-W 056-W 3360 -3351.43 4221434 4030453
056-W 056-W 3380 -3371.43 4221433 4030453
056-W 056-W 3400 -3391.42 4221433 4030453')
#top
setkey(Wellbore, 'well_id')
setkey(Perfs, 'Well_ID', 'top_perf')
top_matched <- Wellbore[unique(Perfs), .SD[which.min(abs(top_perf-md)),.(md, top_perf, err=top_perf-md, x,y,vss)],nomatch=0, by=.EACHI]
setkey(top_matched, 'well_id', 'top_perf')
top_joined <- top_matched[Perfs]
top_joined[,`:=`(i.x=NULL, i.y=NULL,VSS=NULL)]
setnames(top_joined, old=c('err', 'x', 'y', 'vss'), new=paste0('top_', c('err', 'x', 'y', 'vss')))
#bottom
setkey(Perfs, 'Well_ID', 'bot_perf')
bot_matched <- Wellbore[unique(Perfs), .SD[which.min(abs(bot_perf-md)),.(md, bot_perf, err=bot_perf-md, x,y,vss)],nomatch=0, by=.EACHI]
setkey(bot_matched, 'well_id', 'bot_perf')
bot_joined <- bot_matched[Perfs]
bot_joined[,`:=`(i.x=NULL, i.y=NULL,VSS=NULL)]
setnames(bot_joined, old=c('err', 'x', 'y', 'vss'), new=paste0('bot_', c('err', 'x', 'y', 'vss')))
answer <- cbind(top_joined[,c(1:2,9:11,3:7), with=F], bot_joined[,3:7,with=F])
# well_id md well_name surface ID top_perf top_err top_x top_y top_vss bot_perf bot_err
# 1: 056-W 3260 056-W Ranger 2 2808 -452 4221436 4030454 -3251.46 2958 -302
# 2: 056-W 3260 056-W Ranger 1 3150 -110 4221436 4030454 -3251.46 3250 -10
# 3: 056-W 3260 056-W Ranger 2 3150 -110 4221436 4030454 -3251.46 3250 -10
# 4: 056-W 3400 056-W UT_1 1 3559 159 4221433 4030453 -3391.42 3664 264
# 5: 056-W 3400 056-W UT_2 2 3559 159 4221433 4030453 -3391.42 3664 264
# 6: 057-W NA 057-W Tar 1 2471 NA NA NA NA 2952 NA
# 7: 057-W NA 057-W Tar 2 2471 NA NA NA NA 2952 NA
# 8: 058-W NA 058-W Ranger 1 2615 NA NA NA NA 2896 NA
# 9: 058-W NA 058-W Ranger 2 2615 NA NA NA NA 2896 NA
# bot_x bot_y bot_vss
# 1: 4221436 4030454 -3251.46
# 2: 4221436 4030454 -3251.46
# 3: 4221436 4030454 -3251.46
# 4: 4221433 4030453 -3391.42
# 5: 4221433 4030453 -3391.42
# 6: NA NA NA
# 7: NA NA NA
# 8: NA NA NA
# 9: NA NA NA