Question

我有一个包含100,000行（人）和500列（概率）的数据集，并希望以测试概率扫描列，以查找大于和的列（a，b或c）的标题最接近测试值并在新列中记录标题。

例如，使用data.table：

data <- data.table(   a = seq(0.2, 0.55, length.out = 9),
                      b = seq(0.35, 0.7, length.out = 9),
                      c = seq(0.5, 0.85, length.out = 9),
                   test = seq(0.1,  0.9, length.out = 9)
                  )

新列将为第一行记录“a”（为0.1 <0.2），然后为接下来的8行记录a，b，b，b，c，c，c，NA。当测试概率大于列c中的值时，将记录NAs

       a    b   c   test
 0.20000    0.35000 0.50000 0.1
 0.24375    0.39375 0.54375 0.2
 0.28750    0.43750 0.58750 0.3
 0.33125    0.48125 0.63125 0.4
 0.37500    0.52500 0.67500 0.5
 0.41875    0.56875 0.71875 0.6
 0.46250    0.61250 0.76250 0.7
 0.50625    0.65625 0.80625 0.8
 0.55000    0.70000 0.85000 0.9

我最初将其作为矩阵而不是data.table。下面的代码不起作用，但会了解它是如何运作的

Switch <- pmax(as.matrix(data[,a:c])-matrix(rep(test,3), ncol=3, byrow=F),0)  
# subtracts test from columns a,b,c and replaces negative values with 0

FirstSwitch <- Switch[,b:c]>0 & MemSwitch[,a:b]==0
#  finds the first non-zero occurance

MonthSwitchMem <-  apply(FirstSwitch, 1, which.max)
# calculates the column where the test probability first exceeds

如何在data.table中的列之间进行匹配。我想我需要使用.SDcols的查询，但我不知道该怎么做？

Answer 1

我调整了Karolis的答案，因此我将我的列从data.table传输到提供的代码段

data <- data.frame(   a = seq(0.2, 0.55, length.out = 9),
                  b = seq(0.35, 0.7, length.out = 9),
                  c = seq(0.5, 0.85, length.out = 9),
                  test = seq(0.1,  0.9, length.out = 9)
)
data2 <- data.table(data)
id <- c("a","b","c")
f <- function(x, t) {colnames(data2)[apply(sign(x-t), 1,function(vec){ match(1, vec) })]}
data2[, f(.SD, data2[,test]),.SDcols=id ]  #  this line takes the columns with the probabilities and the test probability and transfer to function f

感谢您的帮助（以及重新格式化我的问题。这是我的第一篇文章，因为格式错误而道歉）

Prashant

Answer 2

这适用于数据为矩阵（不是data.table）。

colnames(data)[apply(sign(data[,1:3] - data[,4]), 1, function(vec){ match(1, vec) })]

R data.frame匹配列并返回最接近匹配的列名

2 个答案: