Question

在这个问题中，我需要能够从数据框的列中查找值，不仅基于一个属性，而且基于更多属性和范围与字典进行比较。（是的，这实际上是R conditional replace more columns by lookup）

中故事的延续

对于R-known ppl来说应该是一个简单的问题，因为我提供了基本索引的工作解决方案，需要升级，可能很容易......但对我来说这很难，因为Iam在学习R的过程中

从哪里开始：

根据默认（小）字典列，我确实想要从（大）表 df1 中替换列 testcolnames 中的缺失值 testdefs （通过 testdefs $ LABMET_ID 等于 testcolnames 中的列名来选择行），我使用此代码：

testcolnames=c("80","116") #...result of regexp on colnames(df1), originally much longer

df1[,testcolnames] <- lapply(testcolnames, function(x) { tmpcol<-df1[,x];
  tmpcol[is.na(tmpcol)] <- testdefs$default[match(x, testdefs$LABMET_ID)];
  tmpcol  })

去哪里：

现在 - 我需要升级此解决方案。表格 testdefs 会有（以下示例）同一 LABMET_ID 的多行，只有新的两列名为 lower 和 upper < / strong> ...在选择要替换的值时，需要为变量 df1 $ rngvalue 的界限。

换句话说 - 升级此解决方案不仅要从 testdefs （其中 testdefs $ LABMET_ID 等于列名称）中选择行，还要从这些行中进行选择这样一行，其中 df1 $ rngvalue 处于 testdefs $ lower 和 testdefs $ upper 的范围内（如果不存在，取最近的范围 - 最低或最高，如果字典没有LABMET_ID，我们可以在原始数据中留下NA 。

一个例子：

testdefs

"LABMET_ID","lower","upper","default","notuse","notuse2" 30,0,54750,25,80,2 #..."many columns we dont care about" 46,0,54750,1.45,3.5,0.2 80,0,54750,0.03,0.1,0.01 116,0,30,0.09,0.5,0.01 116,31,365,0.135,0.7,0.01 116,366,5475,0.11,0.7,0.01 116,5476,54750,0.105,0.7,0.02

DF1：

"rngvalue","80","116" 36,NA,NA 600000,NA,NA 367,5,NA 90,NA,6

转化为：

"rngvalue","80","116" 36,0.03,0.135 #col80 is always replaced by 0.03 600000,0.03,0.105 #col116 needs to be decided on range, this value is bigger than everything in dictionary so take the last one 367,5,0.11 #5 not replaced, but second column nicely looks up to 0.11 90,0.03,6 #6 not replaced

Answer 1

由于时间间隔没有差距，您可以使用findInterval。我会使用dlply中的plyr将查找表更改为包含每个值的断点和默认值的列表。

## Transform lookup table to a list with breaks for intervals
library(plyr)
lookup <- dlply(testdefs, .(LABMET_ID), function(x)
    list(breaks=c(rbind(x$lower, x$upper), x$upper[length(x$upper)])[c(T,F)],
         default=x$default))

因此，查找现在看起来像

lookup[["116"]]
# $breaks
# [1]     0    31   366  5476 54750
# 
# $default
# [1] 0.090 0.135 0.110 0.105

然后，您可以使用以下

进行查找

testcolnames=c("80","116")

df1[,testcolnames] <- lapply(testcolnames, function(x) {
    tmpcol <- df1[,x]
    defaults <- with(lookup[[x]], {
        default[pmax(pmin(length(breaks)-1, findInterval(df1$rngvalue, breaks)), 1)]
    })
    tmpcol[is.na(tmpcol)] <- defaults[is.na(tmpcol)]
    tmpcol
})

#   rngvalue   80   116
# 1       36 0.03 0.135
# 2   600000 0.03 0.105
# 3      367 5.00 0.110
# 4       90 0.03 6.000

如果rngvalue超出范围，findInterval将返回低于和高于中断数的值。这就是上面代码中pmin和pmax的原因。

R通过查找字典

1 个答案: