R通过查找字典

时间:2015-07-22 13:39:11

标签: r dataframe lookup na

在这个问题中,我需要能够从数据框的列中查找值,不仅基于一个属性,而且基于更多属性和范围与字典进行比较。 (是的,这实际上是R conditional replace more columns by lookup

中故事的延续

对于R-known ppl来说应该是一个简单的问题,因为我提供了基本索引的工作解决方案,需要升级,可能很容易......但对我来说这很难,因为Iam在学习R的过程中

从哪里开始:

根据默认(小)字典列,我确实想要从(大)表 df1 中替换列 testcolnames 中的缺失值 testdefs (通过 testdefs $ LABMET_ID 等于 testcolnames 中的列名来选择行),我使用此代码:

testcolnames=c("80","116") #...result of regexp on colnames(df1), originally much longer

df1[,testcolnames] <- lapply(testcolnames, function(x) { tmpcol<-df1[,x];
  tmpcol[is.na(tmpcol)] <- testdefs$default[match(x, testdefs$LABMET_ID)];
  tmpcol  }) 

去哪里:

现在 - 我需要升级此解决方案。表格 testdefs 会有(以下示例)同一 LABMET_ID 的多行,只有新的两列名为 lower upper < / strong> ...在选择要替换的值时,需要为变量 df1 $ rngvalue 的界限。

换句话说 - 升级此解决方案不仅要从 testdefs (其中 testdefs $ LABMET_ID 等于列名称)中选择行,还要从这些行中进行选择这样一行,其中 df1 $ rngvalue 处于 testdefs $ lower testdefs $ upper 的范围内(如果不存在,取最近的范围 - 最低或最高,如果字典没有LABMET_ID,我们可以在原始数据中留下NA

一个例子:

testdefs

"LABMET_ID","lower","upper","default","notuse","notuse2"
30,0,54750,25,80,2            #..."many columns we dont care about"
46,0,54750,1.45,3.5,0.2
80,0,54750,0.03,0.1,0.01
116,0,30,0.09,0.5,0.01
116,31,365,0.135,0.7,0.01
116,366,5475,0.11,0.7,0.01
116,5476,54750,0.105,0.7,0.02

DF1:

"rngvalue","80","116"
36,NA,NA
600000,NA,NA
367,5,NA
90,NA,6

转化为:

"rngvalue","80","116"
36,0.03,0.135                   #col80 is always replaced by 0.03
600000,0.03,0.105               #col116 needs to be decided on range, this value is bigger than everything in dictionary so take the last one
367,5,0.11                      #5 not replaced, but second column nicely looks up to 0.11
90,0.03,6                       #6 not replaced

1 个答案:

答案 0 :(得分:2)

由于时间间隔没有差距,您可以使用findInterval。我会使用dlply中的plyr将查找表更改为包含每个值的断点和默认值的列表。

## Transform lookup table to a list with breaks for intervals
library(plyr)
lookup <- dlply(testdefs, .(LABMET_ID), function(x)
    list(breaks=c(rbind(x$lower, x$upper), x$upper[length(x$upper)])[c(T,F)],
         default=x$default))

因此,查找现在看起来像

lookup[["116"]]
# $breaks
# [1]     0    31   366  5476 54750
# 
# $default
# [1] 0.090 0.135 0.110 0.105

然后,您可以使用以下

进行查找
testcolnames=c("80","116")

df1[,testcolnames] <- lapply(testcolnames, function(x) {
    tmpcol <- df1[,x]
    defaults <- with(lookup[[x]], {
        default[pmax(pmin(length(breaks)-1, findInterval(df1$rngvalue, breaks)), 1)]
    })
    tmpcol[is.na(tmpcol)] <- defaults[is.na(tmpcol)]
    tmpcol
})

#   rngvalue   80   116
# 1       36 0.03 0.135
# 2   600000 0.03 0.105
# 3      367 5.00 0.110
# 4       90 0.03 6.000

如果rngvalue超出范围,findInterval将返回低于和高于中断数的值。这就是上面代码中pminpmax的原因。